Before stepping MsgSnap to raft, we bump its term to the receiver Term (to force through its term checks). This would have been fine (because the snapshot carries committed state and can always be handled), but raft assumes the snapshot was sent by the MsgSnap.Term leader, and updates the state accordingly [1, 2]. Since the MsgSnap term could have been bumped arbitrarily, these transitions can be incorrect: we will falsely believe that the snapshot originator is a leader at a different term.
proposals are forwarded to this lead, so can travel some and then be dropped / never reach the actual leader
the lead field is surfaced through the SoftState to CRDB, and we probably have some logic relying on it being correct.
Upd: the lead field is now in HardState and has a role in the correctness-sensitive liveness / fortification / leader leases protocols.
To fix this case, we need to remove this term bump, and allow raft to handle snapshots at outdated terms. A snapshot can always be applied because it carries committed state that is not reversible; except when it carries a committed state that we already have, but it’s easy to check.
Before stepping
MsgSnap
to raft, we bump its term to the receiverTerm
(to force through its term checks). This would have been fine (because the snapshot carries committed state and can always be handled), but raft assumes the snapshot was sent by theMsgSnap.Term
leader, and updates the state accordingly [1, 2]. Since theMsgSnap
term could have been bumped arbitrarily, these transitions can be incorrect: we will falsely believe that the snapshot originator is a leader at a different term.The
lead
field is used for a bunch of things:lead != None
, we can’t voteSoftState
to CRDB, and we probably have some logic relying on it being correct.lead
field is now inHardState
and has a role in the correctness-sensitive liveness / fortification / leader leases protocols.To fix this case, we need to remove this term bump, and allow raft to handle snapshots at outdated terms. A snapshot can always be applied because it carries committed state that is not reversible; except when it carries a committed state that we already have, but it’s easy to check.
Jira issue: CRDB-40401
Epic CRDB-39898