PsiACE / riteraft

RiteRaft - A raft framework, for regular people
Apache License 2.0
318 stars 21 forks source link

fix: fix create snapshot #22

Closed PsiACE closed 1 year ago

PsiACE commented 1 year ago

thanks @jopemachine and @gengliqi

Fixes: #8

PsiACE commented 1 year ago

Detailed explanation written by @gengliqi

In create_snapshot function(https://github.com/ritelabs/riteraft/blob/898a24569f66421f633caf2774bbd345b1dd0479/src/storage.rs#L261-L278), the newly-generated snapshot uses the commit index and term from HardState as its metadata, which is wrong because it is likely not equal to the index/term of the last applied log. If it's not equal, the consistency between different peers is broken.

In main branch, the HardState is saved before handling committed entries in which create_snapshot may be called so the index of snapshot must be greater than or equal to the one of last applied log(That's why it looks like ok). However, in dev branch, the order is reversed so the index of snapshot must be less than the one of last applied log.

Let's see the code here. https://github.com/ritelabs/riteraft/blob/59d4c484a3972709e4fda64fad1842b136d5fcde/src/raft_node.rs#L463-L472 In the specific case of https://github.com/ritelabs/riteraft/issues/8, the index of configuration change log of adding node 3 is 4. Then last_applied is 3. commit of HardState is also 3. store.compact(3) will remove all raft logs whose indexes are less than or equal to 3 and store.create_snapshot(snapshot) will create a snapshot with index 3(should be 4).

What's next? Node 1 is leader. It sends a snapshot with index 3 to node 3. Node 3 will apply the snapshot and send the response to node 1. If everything is ok, then node 1 will send the log with index 4 to node 3. However, node 1 must check if the term of index 3 from the response is the same as itself before sending the next append message. See code here https://github.com/tikv/raft-rs/blob/82d704cdc3d93258be1f45efd715b95930764d7f/src/raft.rs#L822-L839. Unfortunately, the term(3) function of HeedStorage will return Error(raft::StorageError::Compacted). Then node 1 will send a snapshot with index 3 again to node 3. An infinite dead loop!

To sum up:

  1. snapshot's metadata must be equal to the one of last applied log, otherwise many bad things can happen.
  2. the term must be acquired from the term function of Storage interface for the index of new snapshot.

Same thanks to @jopemachine for their enthusiasm, as well as thanks to the riteraft-py project

Since the patch content is very simple, it seems we can apply this change to the riteraft. Here is the patch that fixes issue 8.

https://github.com/lablup/riteraft-py/commit/6de029c62c1cf11e747187a51baf52f4c3cb0e2d