Closed MggMuggins closed 4 weeks ago
Heads up @ru-fu - the "Documentation" label was applied to this issue.
I'll try to provide some more context for why running dqlite_node_recover_ext
with the same configuration on several nodes in the cluster can lead to problems. Suppose that we have three nodes N1
, N2
, N3
, all offline, and their Raft logs look like this:
N1: | A(term=1) | C(term=3) |
N2: | A(term=1) | C(term=3) |
N3: | A(term=1) | B(term=2) |
Where A
, B
, C
are distinct DB transactions. This means:
N1
, the entry at index 0 contains A
and was appended during Raft term 1N1
, the entry at index 1 contains C
and was appended during Raft term 3Now we run dqlite_node_recover_ext
on each node in the cluster. This means each node created a log entry D
containing the new configuration and appends it locally to its log. The current implementation always uses the term 1
for this new entry, so we get:
N1: | A(term=1) | C(term=3) | D(term=1) |
N2: | A(term=1) | C(term=3) | D(term=1) |
N3: | A(term=1) | B(term=2) | D(term=1) |
But now we have a problem, because this set of logs breaks one of the invariants from the Raft paper, the Log Matching Property (see page 5 here):
Log Matching: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index.
If we bring up the cluster in this state, N3
's database will be inconsistent with those of N1
and N2
.
This specific case would be fixed if dqlite_node_recover_ext
were changed to copy the term number for the new entry from the previous entry in the log. But that breaks the Log Matching Property in a different case---suppose the situation before dqlite_node_recover_ext
is
N1: | A(term=1) | B(term=2) | C(term=2) |
N2: | A(term=1) | B(term=2) |
N3: | A(term=1) | B(term=2) |
Then after dqlite_node_recover_ext
on every node we get
N1: | A(term=1) | B(term=2) | C(term=2) | D(term=2) |
N2: | A(term=1) | B(term=2) | D(term=2) |
N3: | A(term=2) | B(term=2) | D(term=2) |
So we have an inconsistency at log index 2. I haven't been able to come up with a way of computing the term number in dqlite_node_recover_ext
that doesn't cause this problem in some situation when you run it on all nodes in the cluster.
I've modified the cluster recovery docs to follow the procedure described in the dqlite docs. The procedure works great... when it's used without cluster member address changes. I overlooked this in my initial assessment.
During lxd cluster edit
, we update the local database with the new configuration's raft nodes. cluster edit
also creates a patch for the global database to update addresses there, although if I understand correctly this will be applied when the edited node is able to access the db, so it's a secondary concern.
I don't see a way to update the local database without doing something akin to what I'm proposing in microcluster, where we export a database tarball during the recovery procedure and then import it on daemon startup. The tarball should include the yaml the user submitted to lxd cluster edit
for easy validation/import on the other members.
If I'm making code changes related to recovery it's probably a good idea to include a warning for cluster edit
akin to the one in microcluster
, since we're still using dqlite_node_recover_ext
.
lxd recover-from-quorum-loss
currently calls the deprecatedNode.Recover
to reset the dqlite raft log with only the current node as a member of the cluster. While it should useReconfigureMembershipExt
instead, this usage is sound. However, the documentation seems to miss a step:Node.Recover
andReconfigureMembershipExt
invokedqlite_node_recover_ext
. The comment block there indicates that the function should be called exactly once, after which the entire data directory for all remaining dqlite members should be completely replaced by the data dir from the member wheredqlite_node_recover_ext
was run.The real issue here is that the docs for
lxd cluster edit
direct users to run the command on all remaining nodes.lxc cluster edit
callsReconfigureMembershipExt
as well, which means that this violates the directions provided in the dqlite docs.@cole-miller provided some additional context from the raft/dqlite side (please correct me if I'm wrong). I don't have a clear understanding of how dqlite handles appending & committing changes to the raft log; however, in the case where one or more cluster members have (differing) uncommitted changes in their logs, calling
dqlite_node_recover_ext
will cause them to commit those changes. This breaks Raft's log matching property and means that the state of the database may not be consistent between cluster members.The solution here is as described in the dqlite docs;
lxc cluster edit
may be called on one node, after which lxd's dqlite DB dir (/var/snap/lxd/common/lxd/database
) needs to duplicated to all other (remaining) cluster members.The same procedure should be documented for
recover-from-quorum-loss
as well, in the case that it's used with more than one remaining cluster member.I also think we can dramatically improve the user experience here by exporting the DB dir to a tarball in the user's CWD, and providing some kind of
lxd global-db-import
to perform the copy; happy to open a separate issue if we want to track that separate from the needed docs changes here.