cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.04k stars 3.8k forks source link

kvclient: do not evict range descriptors on stale RangeKeyMismatchError #105294

Open arulajmani opened 1 year ago

arulajmani commented 1 year ago

Describe the problem

When the DistSender receives a RangeKeyMismatchErrors it simply terminates its inner loop (sendToReplicas) and evicts the range descriptor. In doing so, it updates its range cache with the range descriptors in the error; see:

https://github.com/cockroachdb/cockroach/blob/5dd802945591fce48b1e357e4a9f91ec13f82057/pkg/kv/kvclient/kvcoord/dist_sender.go#L1895-L1908

Now, if the error is returned by a replica that's lagging (say it hasn't applied a split or a merge yet), we'll end up needlessly evicting the range descriptor. Worse yet, if on the subsequent retry the sender picks this replica[1] as the first one to route to again, we'll find ourselves in an infinite loop until this replica is caught up.

Proposed solution

We know when a RangeKeyMismatchError is originating from a replica that isn't caught up yet, by virtue of the descriptor generation in the first entry of its Ranges slice. Currently, sendToReplicas short circuits without peeking into this slice:

https://github.com/cockroachdb/cockroach/blob/5dd802945591fce48b1e357e4a9f91ec13f82057/pkg/kv/kvclient/kvcoord/dist_sender.go#L2533-L2536

Maybe it should. That is, it should only terminate its iteration if the RangeKeyMismatch error is coming from a replica that has a descriptor generation newer than what's in the range cache.

cc @kvoli @andrewbaptist


[1] This can happen if the sender believes this replica has the lease or if it has no lease information and the replica is closest to it.

Jira issue: CRDB-28967

blathers-crl[bot] commented 1 year ago

Hi @andrewbaptist, please add branch-* labels to identify which branch(es) this release-blocker affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

andrewbaptist commented 1 year ago

After looking through this some more and talking to @arulajmani it appears that this code has existed for a long time. This was likely exposed because we had follower pausing on for this cluster and saw follower pausing around the time of this issue.

This is a good "first project" in the range cache code as the fix is not overly complex but requires writing a good test to expose it.