Closed erikgrinaker closed 6 months ago
Redirecting KV requests through even just a single follower replica ("one-hop proxying") allows CRDB to make a reasonably easy-to-understand availability claim. Pending other availability work, ranges should always remain available if a majority of their replicas are connected. This proxying work ensures that load originating from a gateway will remain available as long as it can communicate with any replica in that majority. Since any two majorities overlap, this project will allow us to say that regardless of how a network is (partially) partitioned between replicas, a workload is guaranteed to remain available if it is routed to a gateway that can communicate with at least a majority of the replicas in the range that it is reading from/writing to. Of course, it may be more available than this in many cases, but this would be a useful worst-case guarantee to make. It starts to look like the availability requirement for a leaderless system.
Hi @andrewbaptist, please add branch-* labels to identify which branch(es) this release-blocker affects.
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.
During e.g. partial network partitions, a SQL gateway may be unable to reach a leaseholder directly, but other nodes can reach it just fine (see internal document). In these cases requests will currently stall in indefinite retry loops (although we plan to implement circuit breakers in #93501).
We stall because when the DistSender is unable to reach the current leaseholder it tries to use one of the other replicas in the range, but these will simply return a
NotLeaseHolderError
pointing the DistSender right back to the unreachable leaseholder. Instead, when the DistSender has already tried the current leaseholder it could signal this in the request and the follower could try to process the request, e.g. via:Attempt to proxy to the request to the leaseholder and return the response back to the client. The follower has already received a copy of the request, and the DistSender is already prepared to receive a full response from it, so we just need to try to forward it and fall back to a
NotLeaseHolderError
if we can't. In most cases range replicas are spread across independent failure domains, so it's likely that a follower will be able to do this.Attempt a follower read. If the read timestamp is below the closed timestamp and the follower has data at this timestamp, it can serve it as a consistent follower read. It could also wait for the closed timestamp to advance to the read timestamp if it can't otherwise reach the leaseholder.
This can be further optimized by e.g. sending the request to all followers in parallel, or keeping statistics about which followers are able to proxy to the leaseholder and how efficiently they can do so.
Jira issue: CRDB-22370
Epic CRDB-25200