Accept any snapshot that allows replication

erikgrinaker commented 11 months ago

This PR is adopted from #84.

A leader will not take into account snapshots reported by a follower unless they match or exceed the tracked PendingSnapshot index (which is the leader's last indexat the time of requesting the snapshot). This is too inflexible: the leader should take into account any snapshot that reconnects the follower to its log. This PR makes that change.

In doing so, it addresses long-standing problems that we've encountered in CockroachDB. Unless you create the snapshot immediately and locally when raft emits an MsgSnap, it's difficult/impossible to later synthesize a snapshot at the requested index. It is possible to get one above the requested index which raft always accepted, but CockroachDB delegates snapshots to followers who might be behind on applying the log, and it is awkward to have to wait for log application to send the snapshot just to satisfy an overly strict condition in raft. Additionally, CockroachDB also sends snapshots preemptively when adding a new replica since there are qualitative differences between an initial snapshot and one needed to reconnect to the log and one does not want to wait for raft to round-trip to the follower to realize that a snapshot is needed. In this case, the sent snapshot is commonly behind the PendingSnapshot since the leader transitions the follower into StateProbe when a snapshot is already in flight.

Touches https://github.com/cockroachdb/cockroach/issues/84242. Touches https://github.com/cockroachdb/cockroach/issues/87553. Touches https://github.com/cockroachdb/cockroach/pull/87554. Touches https://github.com/cockroachdb/cockroach/issues/97971. Touches https://github.com/cockroachdb/cockroach/issues/114349. See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker grinaker@cockroachlabs.com Signed-off-by: Tobias Grieger tobias.b.grieger@gmail.com

erikgrinaker commented 11 months ago

cc @pavelkalinnikov @ahrtr

ahrtr commented 11 months ago

CockroachDB delegates snapshots to followers who might be behind on applying the log

I am curious about the design. In raft, data only flows outwards from the leader to other servers. But it seems that data could also flow from a follower to another follower in CockroachDB?

erikgrinaker commented 11 months ago

I am curious about the design. In raft, data only flows outwards from the leader to other servers. But it seems that data could also flow from a follower to another follower in CockroachDB?

That's right, snapshots can be sent from any replica via delegated snapshots. For example, cross-region traffic is often much more expensive than inter-region traffic, so in multi-region clusters we prefer sending snapshots from other replicas in the same region where possible. This commonly happens when moving a replica from one CRDB node to a different node in the same region due to rebalancing.

erikgrinaker commented 10 months ago

I think this should be ready for another look now @pav-kv @ahrtr.

ahrtr commented 10 months ago

That's right, snapshots can be sent from any replica via delegated snapshots. For example, cross-region traffic is often much more expensive than inter-region traffic, so in multi-region clusters we prefer sending snapshots from other replicas in the same region where possible.

thx for the info. I understood the motivation now.

But the implementation in this PR is a little strange to me. The behaviour of delegating snapshot is completely outside of raft's control, instead it's just application's behaviour (CockroachDB in this case), but this PR modifies raft to adapt to the behaviour which is outside of raft.

Is it possible to reset the follower's state to StateProbe in such case, something like https://github.com/etcd-io/raft/pull/123, so that we can remove the option ResumeReplicateBelowPendingSnapshot?

One more not-directly-related comment, once the flag ResumeReplicateBelowPendingSnapshot is enabled, it seems that CockroachDB also needs to ignore the snapshot driven by raft, otherwise the application will send two snapshots in a short period? Won't it complicate the application's design?

pav-kv commented 10 months ago

@ahrtr

But the implementation in this PR is a little strange to me. The behaviour of delegating snapshot is completely outside of raft's control, instead it's just application's behaviour (CockroachDB in this case), but this PR modifies raft to adapt to the behaviour which is outside of raft.

This behaviour will ~~not~~ impact users who don't send snapshots outside of raft. Consider the caveat:

Raft code internally assumes that there is always at most one snapshot in flight. This only seems to be true. Even if raft is used the standard way, I'm moderately sure this property can be violated. Raft messages can be duplicated and reordered, and the process can be restarted. Consider the following scenario (which can be driven entirely by raft, without delegated snapshots):

Send a bunch of append messages at indices 100-150 to a follower.
Compact the log at index 150 (for whatever reason, e.g. realize that most followers are up-to-date).
Restart. Realize that the follower is behind, send snapshot at index 150.
Restart. Handle a bunch of appends at indices 150-200.
Realize that the follower is behind, send snapshot at index 200.
- // NB: now we have appends 100-200, snapshot@150 and snapshot@200 in flight.
The snapshot@150 from step (3) finally arrives at the follower, follower applies it and acks.
The leader whose pendingSnapshot is 200, receives an ack at index 150.
- It would be nice to switch to Replicate here. Moreover, indices 150-200 are known to be in flight. So we can even resume optimistic replication from index 200 rather than 150, under some conditions.

With this observation in mind, the scenarios similar to what this PR fixes, can actually happen for any user. This PR is therefore a strict improvement.

The question of what's raft- and what's application- specific is tricky. In my opinion, the Probe/Replicate/Snapshot state behaviours are not inherent to raft algorithm itself. These are "flow control" mechanisms of this specific raft implementation, and they don't impact "correctness" of raft. As such, these mechanisms should aim to reflect the needs of the package users (and best be flexible to fit any application).

Is it possible to reset the follower's state to StateProbe in such case, something like #123, so that we can remove the option ResumeReplicateBelowPendingSnapshot?

The only difference between this PR and #123 is the state to which we transition into. This PR translates to StateReplicate, while #123 transitions to StateProbe. The ResumeReplicateBelowPendingSnapshot flag is seemingly orthogonal to the choice of state here. I half-agree that the flag is not needed, if we make a clean and compatible code change.

Could you explain why choose transition to StateProbe though?

I'll explain why transitioning to StateReplicate makes more sense to me. Some reasoning can be found in an earlier comment. Each of StateReplicate, StateProbe, StateSnapshot should have an "invariant". While the invariant is true, the flow stays in this state. When transitioning between states, we should a) make sure the new state's invariant is true, b) if multiple states are eligible, pick the "best" one.

The invariant of the StateReplicate is that, according to the latest information (received via messages), the follower's log is "compatible" with the leader's log, i.e. Next >= firstIndex(); and for each "optimistic" index (from Match+1 to Next-1) there is an in-flight append message that isn't rejected.
The invariant of the StateProbe is that the follower's log is not compatible with the leader's log, or this compatibility has not been established yet. While in StateProbe, we don't optimistically chain appends.

From these definitions, it makes sense to transition to StateReplicate as soon as its invariant becomes true. That's what this PR does. #123 would unnecessarily hop to StateProbe, which we know will then transition to StateReplicate almost immediately.

As a side note, I think we need to clean up the flow state machine and make these invariants verbose.

pav-kv commented 10 months ago

For easing the mental model, think about a snapshot message as a MsgApp with all entries in [0, snapshotIndex]. From this standpoint, any set of in-flight MsgApp and snapshot messages is equivalent to just a set of in-flight MsgApp messages. Snapshots, just like all other messages, can be dropped, duplicated, reordered and delayed.

From this standpoint, we should treat snapshot message sends and replies just the same way we treat MsgApp. We should not assume there is only one such message in flight (except as best effort for avoiding sending too much), and we should gracefully handle acks at indices below the index of the last sent message. This PR is a step in that direction.

erikgrinaker commented 10 months ago

it seems that CockroachDB also needs to ignore the snapshot driven by raft, otherwise the application will send two snapshots in a short period? Won't it complicate the application's design?

CRDB drops the MsgSnap sent from Raft, and handles the snapshot sending itself.

https://github.com/cockroachdb/cockroach/blob/2f7bf071af146e1e238f985884b99b2a6636115a/pkg/kv/kvserver/replica_raft.go#L1799-L1803

The delegated snapshot is initiated when the leader sees the MsgSnap sent from Raft, where the leader can delegate the actual snapshot sending to a follower. However, the follower can't produce a snapshot at any arbitrary index, it will be produced at whatever applied index the follower happens to be at when it receives the snapshot delegation request. This index may lag the leader's PendingSnapshot index. Of course, we could pass the PendingSnapshot index to the follower, and have the follower either wait to catch up or reject it, but this seems unnecessary.

There are also other cases where this can happen -- for example:

We eagerly send snapshots when adding new replicas to a Raft group, which I think can race with Raft processing (although we addressed some of the more problematic race conditions here in cockroachdb/cockroach#106793).
On leader change, if the old leader was already sending a snapshot to a follower, the new leader will reject it because its own PendingSnapshot index is higher than the old leader's. I'll have to look at the details, but I think this is what happened in https://github.com/cockroachdb/cockroach/issues/114349.

Even if raft is used the standard way, I'm moderately sure this property can be violated.

Yeah, I think this can probably happen across leader changes too -- see above. I can write up a test case to confirm.

it makes sense to transition to StateReplicate as soon as its invariant becomes true. That's what this PR does. https://github.com/etcd-io/raft/pull/123 would unnecessarily hop to StateProbe, which we know will then transition to StateReplicate almost immediately.

I agree with this. We're processing an MsgAppResp here, so we already know the follower's state, and don't have to wait for another MsgApp probe -- we can step straight to StateReplicate.

remove the option ResumeReplicateBelowPendingSnapshot

I'm not particularly attached to this option. We could merge it as-is without the option, and consider a revert or option if we discover any problems with it. There shouldn't be any correctness or availability implications, just possibly additional work in rare cases.

erikgrinaker commented 10 months ago

Yeah, I think this can probably happen across leader changes too -- see above. I can write up a test case to confirm.

This isn't the case, because the follower will reject an MsgSnap from a past term. I think it can happen with CRDB though, because we send and apply snapshots out-of-band.

I think this is what happened in https://github.com/cockroachdb/cockroach/issues/114349

No, I think that was simply an eager initial snapshot that was sent before the leader transitioned to StateSnapshot.

pav-kv commented 10 months ago

Yeah, then the scenario in my comment can't happen either.

That's lucky, but unfortunate in some sense too. It doesn't matter which leader term sent a snapshot/append: as long as a snapshot/append "happens after" the follower's state and is recent, it should be ok to accept. E.g. if during a term change an append from previous term is slightly delayed, it still likely adheres to append-only properties etc, so with some safety checks it can be applied, and retries avoided. This gives some room for optimizations, and what enables it is moving safety checks down the stack and relaxing dependency on the "always reject non-matching term" behaviour.

Anyway, my point is still that this PR (and, in broader sense, formalizing and cleaning up the Probe/Replicate/etc states/transitions) benefits all users and shouldn't be considered CRDB specific.

ahrtr commented 10 months ago

The main concern is about adding public user-facing flag. Usually when we add a public flag for new features, e.g. AsyncStorageWrites, it brings new feature/benefit but with risk of introducing regression. It makes perfect sense to add a new public flag for such case, and it's super clear to users as well. "clear" means it's clear when/why to enable or disable it.

But this flag ResumeReplicateBelowPendingSnapshot, is it for a new feature? It seems that it's just safe-guard flag for a fix to a bug? It's not clear for users when/why ( and even how ) to enable/disable it.

I agree that the comment https://github.com/etcd-io/raft/pull/110#discussion_r1405420442 makes more sense. I like the idea of changing state based on invariants. If we make some enhance and add a flag for it, then it makes more sense.

From implementation perspective, I agree this PR is safe, because the behaviour will keep unchanged as long as users do not enable ResumeReplicateBelowPendingSnapshot. We could let it merged for now if you are planning to eventually remove it and resolve https://github.com/etcd-io/raft/pull/110#discussion_r1405420442.

Raft code internally assumes that there is always at most one snapshot in flight.

The concerning isn't about guaranteeing at most one snapshot in flight, it's about waste of bandwidth, because a snapshot may be huge in size, e.g a couple of GBs. We should try to avoid sending snapshot as much as possible. [If CockRoachDB enable ResumeReplicateBelowPendingSnapshot, and doesn't take any special action to drop the MsgSnap sent from Raft, it seems that it will almost always send two snapshots in a short period for cases handling a slow follower which lag too much?]

Could you explain why choose transition to StateProbe though?

Based on current raft protocol/implementation, when a follower is in StateSnapshot, when a leader received a not-out-of- date MsgAppResp from the follower, the condition pr.Match >= pr.PendingSnapshot should be always true. Otherwise, there must be something wrong for most users [For CRDB, it may be expected behaviour though due to the delegating snapshot feature], so we'd better to probe the follower's index again? This is just my immature thought.

erikgrinaker commented 10 months ago

The main concern is about adding public user-facing flag.

I've removed the option.

The concerning isn't about guaranteeing at most one snapshot in flight, it's about waste of bandwidth, because a snapshot may be huge in size, e.g a couple of GBs.

Yes, that's also the motivation for us. In CRDB, we have already sent a snapshot to the follower (below PendingSnapshot), but the leader rejects this snapshot and insists we send a new one above PendingSnapshot. We would like to avoid this.

Based on current raft protocol/implementation, when a follower is in StateSnapshot, when a leader received a not-out-of- date MsgAppResp from the follower, the condition pr.Match >= pr.PendingSnapshot should be always true. Otherwise, there must be something wrong for most users [For CRDB, it may be expected behaviour though due to the delegating snapshot feature], so we'd better to probe the follower's index again?

Well, from etcd/raft's point of view, even if we did receive an out-of-date MsgAppResp that was below PendingSnapshot but above the leader's first log index, then the follower must have advanced in the meanwhile (or the leader's log storage was temporarily unavailable). If the follower has advanced and recovered from needing a snapshot, it seems fine to move it back to StateReplicate.

With StateProbe, we'll eat the cost of an MsgApp roundtrip only to discover that Next: PendingSnapshot + 1 was incorrect, then resume replication at the correct index. Is the motivation for using StateProbe rather than StateReplicate to avoid replicating entries below PendingSnapshot that the follower may already have received from the snapshot in the case of an MsgAppResp race? Are there any other downsides to using StateReplicate?

To be clear, I'd be fine with StateProbe here -- it's also what we use with e.g. ReportSnapshot(). Just curious if there is a compelling reason to prefer it over StateReplicate.

erikgrinaker commented 10 months ago

Is the motivation for using StateProbe rather than StateReplicate to avoid replicating entries below PendingSnapshot that the follower may already have received from the snapshot in the case of an MsgAppResp race?

Actually, that can't really happen, because the snapshot does not carry the actual log entries, and may have truncated the log. It's possible that we'll replicate log entries that will be clobbered by the later snapshot, thus wasting work, but that can also happen in the StateProbe case.

ahrtr commented 10 months ago

Overall looks good, thx.

Note that I am planning to release raft 3.6.0, please see https://github.com/etcd-io/raft/issues/89. Please feel free to add comment under that issue on whether you are going to resolve https://github.com/etcd-io/raft/pull/110#discussion_r1405420442 in 3.6.0 or next release e.g. 4.0.

Just curious if there is a compelling reason to prefer it over StateReplicate.

No for now.

erikgrinaker commented 10 months ago

I think we'll have to do the PendingSnapshot removal and other improvements in 4.0, since they would be breaking changes to the public APIs. Wrote up a comment, and I'll write up a quick issue shortly.

erikgrinaker commented 10 months ago

Moved follow-up work to https://github.com/etcd-io/raft/issues/124

etcd-io / raft

Accept any snapshot that allows replication #110