cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.03k stars 3.79k forks source link

raft: allow ReportSnapshot to take the index of the snapshot #87583

Open tbg opened 2 years ago

tbg commented 2 years ago

Is your feature request related to a problem? Please describe.

This would provide a way to fix https://github.com/cockroachdb/cockroach/issues/87581.

Describe the solution you'd like

When raft requests a snapshot, it requests one at a particular index. Yet, we can't honor that and send a snapshot at some other index. If our index is below raft's, raft may end up thinking that the follower now "has" entries that it doesn't, in fact, have. This leads to violations of the crucial property that committed indexes are durably persisted on a quorum.

Describe alternatives you've considered

We could leave as is and fix #87581 via workarounds.

Additional context

If we do this we should take the opportunity and move followers into StateReplicate in SnapshotStatus. See https://github.com/cockroachdb/cockroach/issues/84242#issuecomment-1240405737.

Jira issue: CRDB-19426

Epic CRDB-39898

blathers-crl[bot] commented 2 years ago

cc @cockroachdb/replication

pav-kv commented 2 years ago

Yet, we can't honor that and send a snapshot at some other index. If our index is below raft's, ...

In which ways can the snapshot we send diverge from the requested, and what does it depend on?

I imagine that the semantics may differ, depending on the reason why the snapshot was requested: if it's not == x, sometimes we may need min >= x, sometimes max <= x. Is it anything like that now?

tbg commented 2 years ago

The way it works is that raft will ask for a snapshot to be sent by calling into this method (into our code):

https://github.com/cockroachdb/cockroach/blob/3fe8ffc7f854b6ff9689735dbb1e6ba6e661e027/pkg/kv/kvserver/replica_raftstorage.go#L399-L428

Now, if we actually want to be able to send that snapshot at that index, we need to grab a pebble snapshot right there and then, for once the applied index soldiers on, you can't get a historical snapshot. And you can see how delegated follower snaps are dead in the water right here, because now you need to "stop the world", make a follower catch up to exactly that index but not ahead of it, and then let it grab a pebble snapshot - not going to happen.

So instead we tell raft about this applied index, but then the world moves on. We just tell the raft snap queue that raft requested a snapshot to be sent. At some point in the future, it gets around to actually doing that. It then sends a snapshot at whatever the applied index is at that point. But because it's the same process, and because the applied index doesn't regress (except possibly on process restart), what we send at will be ahead of what raft requested. Unless it's a delegated snapshot, then currently all bets are off.

Where I'm going at with this issue is that raft should stop trying to force a particular index onto us. I think the way the original authors imagined this would work just isn't reasonable in practice. When raft requests a snapshot, it shouldn't be opinionated about what index that snapshot should be at, of course the snapshot should be in advance of the truncated index of the leader, but raft does not control log truncation so it might as well let userspace handle the selection of an appropriate index. When the snapshot goes go through, raft does need to know what the actual index was, so we should tell it. That is really all it needs.

tbg commented 1 year ago

https://github.com/cockroachdb/cockroach/pull/106793 paves the way for this, since the snapshot's index is now available - the MsgAppResp is getting returned to the caller synchronously and is stepped into raft which basically does what the issue above proposes but without needing to change the public API. I think that is better since having a caller pass in an index opens the door to a new class of bugs.

The one last thing we need - I think - to have snapshots always have the desired effect (moving the follower back to StateReplicate) is https://github.com/etcd-io/raft/pull/84.