Open pav-kv opened 14 hours ago
To motivate this change, it would be good to have a repro for the linked example.
Alternative/complementary approach: when a replica applies a snapshot, it should send a MsgAppResp
to the currently known leader, instead of (or in addition to) replying to MsgSnap.From. Today, From == the initiator of the snapshot, which is not necessarily the leader. In a leader/leaseholder split situation, the MsgAppResp
sent back to the originator is a no-op, and the leader remains in the unknown that the snapshot has been applied.
Downside of this approach: if this MsgAppResp
is lost/dropped, the replication flow will remain stuck while the leader is streaming its snapshot. So this approach could be combined with the described MsgApp
probing if we want to be resilient to this.
Currently,
raft
does not sendMsgApp
probes to a peer if its flow is inStateSnapshot
. This stalls replication to this peer until the outstanding snapshot has been streamed.In CRDB, snapshots can be initiated by
raft
leader, or by a leaseholder when it adds a learner replica. In leader/leaseholder split situations, there can be a leaseholder-initiated snapshot racing with the raft-initiated snapshot. The raft's snapshot can be queued behind the learner snapshot [example]. This doubles the replication "stall" duration to 2x the time it takes to transfer a snapshot (e.g. in the linked example it takes ~16s). If, in the meantime, this peer is promoted to a voter (like in the example above: once the learner snapshot is done,ChangeReplicas
promotes the replica to voter), this replication stall negatively impacts availability/latency (esp. if this voter is made the leaseholder, like in the linked example).Another hypothetical situation like this: one leader starts streaming a snapshot to a follower, then a leader change happens, and the new leader starts streaming a snapshot too. The second snapshot is queued behind the first one, which similarly prolongs the replication stall.
To get replication unstuck, it would be beneficial for the leader to learn that the
StateSnapshot
peer actually ended up upreplicated while the snapshot was in flight. One way to achieve that: keep sendingMsgApp
probes while the flow is inStateSnapshot
. If it happens that the peer is caught up (either by our snapshot, or by some other snapshot or delayedMsgApp
), the leader can restore the flow toStateReplicate
early. In the above example, the leader would restore theMsgApp
flow 8s earlier, as soon as the leaseholder-initiated learner snapshot completes.Downside of this approach: extra
MsgApp
probes traffic when inStateSnapshot
. In most cases, leader == leaseholder and this race does not occur. We could try disabling this probing conditionally (e.g. if we know "locally" that there is already another snapshot in flight; or if we know that this replica is still a learner, so latency does not matter).Jira issue: CRDB-43989