canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.34k stars 931 forks source link

Cluster Recovery Process misuses dqlite #13524

Closed MggMuggins closed 4 weeks ago

MggMuggins commented 4 months ago

lxd recover-from-quorum-loss currently calls the deprecated Node.Recover to reset the dqlite raft log with only the current node as a member of the cluster. While it should use ReconfigureMembershipExt instead, this usage is sound. However, the documentation seems to miss a step:

Node.Recover and ReconfigureMembershipExt invoke dqlite_node_recover_ext. The comment block there indicates that the function should be called exactly once, after which the entire data directory for all remaining dqlite members should be completely replaced by the data dir from the member where dqlite_node_recover_ext was run.

The real issue here is that the docs for lxd cluster edit direct users to run the command on all remaining nodes. lxc cluster edit calls ReconfigureMembershipExt as well, which means that this violates the directions provided in the dqlite docs.

@cole-miller provided some additional context from the raft/dqlite side (please correct me if I'm wrong). I don't have a clear understanding of how dqlite handles appending & committing changes to the raft log; however, in the case where one or more cluster members have (differing) uncommitted changes in their logs, calling dqlite_node_recover_ext will cause them to commit those changes. This breaks Raft's log matching property and means that the state of the database may not be consistent between cluster members.

The solution here is as described in the dqlite docs; lxc cluster edit may be called on one node, after which lxd's dqlite DB dir (/var/snap/lxd/common/lxd/database) needs to duplicated to all other (remaining) cluster members.

The same procedure should be documented for recover-from-quorum-loss as well, in the case that it's used with more than one remaining cluster member.

I also think we can dramatically improve the user experience here by exporting the DB dir to a tarball in the user's CWD, and providing some kind of lxd global-db-import to perform the copy; happy to open a separate issue if we want to track that separate from the needed docs changes here.

github-actions[bot] commented 4 months ago

Heads up @ru-fu - the "Documentation" label was applied to this issue.

cole-miller commented 4 months ago

I'll try to provide some more context for why running dqlite_node_recover_ext with the same configuration on several nodes in the cluster can lead to problems. Suppose that we have three nodes N1, N2, N3, all offline, and their Raft logs look like this:

N1: | A(term=1) | C(term=3) |
N2: | A(term=1) | C(term=3) |
N3: | A(term=1) | B(term=2) |

Where A, B, C are distinct DB transactions. This means:

One way this can happen: `N3` won the election in term 2, created `B`, but crashed before replicating it to the rest of the cluster. Then `N1` won the election in term 3 (with a vote from `N2`) and created and replicated `C`.

Now we run dqlite_node_recover_ext on each node in the cluster. This means each node created a log entry D containing the new configuration and appends it locally to its log. The current implementation always uses the term 1 for this new entry, so we get:

N1: | A(term=1) | C(term=3) | D(term=1) |
N2: | A(term=1) | C(term=3) | D(term=1) |
N3: | A(term=1) | B(term=2) | D(term=1) |

But now we have a problem, because this set of logs breaks one of the invariants from the Raft paper, the Log Matching Property (see page 5 here):

Log Matching: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index.

If we bring up the cluster in this state, N3's database will be inconsistent with those of N1 and N2.

This specific case would be fixed if dqlite_node_recover_ext were changed to copy the term number for the new entry from the previous entry in the log. But that breaks the Log Matching Property in a different case---suppose the situation before dqlite_node_recover_ext is

N1: | A(term=1) | B(term=2) | C(term=2) |
N2: | A(term=1) | B(term=2) |
N3: | A(term=1) | B(term=2) |
This could happen if `N1` was the leader in term 2, it successfully replicated `B` to everyone, then created `C` but crashed before replicating it.

Then after dqlite_node_recover_ext on every node we get

N1: | A(term=1) | B(term=2) | C(term=2) | D(term=2) |
N2: | A(term=1) | B(term=2) | D(term=2) |
N3: | A(term=2) | B(term=2) | D(term=2) |

So we have an inconsistency at log index 2. I haven't been able to come up with a way of computing the term number in dqlite_node_recover_ext that doesn't cause this problem in some situation when you run it on all nodes in the cluster.

MggMuggins commented 3 months ago

I've modified the cluster recovery docs to follow the procedure described in the dqlite docs. The procedure works great... when it's used without cluster member address changes. I overlooked this in my initial assessment.

During lxd cluster edit, we update the local database with the new configuration's raft nodes. cluster edit also creates a patch for the global database to update addresses there, although if I understand correctly this will be applied when the edited node is able to access the db, so it's a secondary concern.

I don't see a way to update the local database without doing something akin to what I'm proposing in microcluster, where we export a database tarball during the recovery procedure and then import it on daemon startup. The tarball should include the yaml the user submitted to lxd cluster edit for easy validation/import on the other members.

If I'm making code changes related to recovery it's probably a good idea to include a warning for cluster edit akin to the one in microcluster, since we're still using dqlite_node_recover_ext.