raft: send MsgApp probes while in StateSnapshot

pav-kv commented 14 hours ago

Currently, raft does not send MsgApp probes to a peer if its flow is in StateSnapshot. This stalls replication to this peer until the outstanding snapshot has been streamed.

In CRDB, snapshots can be initiated by raft leader, or by a leaseholder when it adds a learner replica. In leader/leaseholder split situations, there can be a leaseholder-initiated snapshot racing with the raft-initiated snapshot. The raft's snapshot can be queued behind the learner snapshot [example]. This doubles the replication "stall" duration to 2x the time it takes to transfer a snapshot (e.g. in the linked example it takes ~16s). If, in the meantime, this peer is promoted to a voter (like in the example above: once the learner snapshot is done, ChangeReplicas promotes the replica to voter), this replication stall negatively impacts availability/latency (esp. if this voter is made the leaseholder, like in the linked example).

Another hypothetical situation like this: one leader starts streaming a snapshot to a follower, then a leader change happens, and the new leader starts streaming a snapshot too. The second snapshot is queued behind the first one, which similarly prolongs the replication stall.

To get replication unstuck, it would be beneficial for the leader to learn that the StateSnapshot peer actually ended up upreplicated while the snapshot was in flight. One way to achieve that: keep sending MsgApp probes while the flow is in StateSnapshot. If it happens that the peer is caught up (either by our snapshot, or by some other snapshot or delayed MsgApp), the leader can restore the flow to StateReplicate early. In the above example, the leader would restore the MsgApp flow 8s earlier, as soon as the leaseholder-initiated learner snapshot completes.

Downside of this approach: extra MsgApp probes traffic when in StateSnapshot. In most cases, leader == leaseholder and this race does not occur. We could try disabling this probing conditionally (e.g. if we know "locally" that there is already another snapshot in flight; or if we know that this replica is still a learner, so latency does not matter).

Jira issue: CRDB-43989

pav-kv commented 14 hours ago

To motivate this change, it would be good to have a repro for the linked example.

pav-kv commented 13 hours ago

Alternative/complementary approach: when a replica applies a snapshot, it should send a MsgAppResp to the currently known leader, instead of (or in addition to) replying to MsgSnap.From. Today, From == the initiator of the snapshot, which is not necessarily the leader. In a leader/leaseholder split situation, the MsgAppResp sent back to the originator is a no-op, and the leader remains in the unknown that the snapshot has been applied.

Downside of this approach: if this MsgAppResp is lost/dropped, the replication flow will remain stuck while the leader is streaming its snapshot. So this approach could be combined with the described MsgApp probing if we want to be resilient to this.

cockroachdb / cockroach

raft: send MsgApp probes while in StateSnapshot #134257