cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.96k stars 3.79k forks source link

kvcoord: handle follower reads with DistSender circuit breakers #119923

Open erikgrinaker opened 7 months ago

erikgrinaker commented 7 months ago

The DistSender circuit breakers from https://github.com/cockroachdb/cockroach/pull/118943 should possibly handle follower reads better. The two main cases are:

There are two interactions to keep in mind:

Potential follower reads use RoutingPolicy_NEAREST (this includes meta range lookups). We should consider adding specialized handling for these. RoutingPolicy_LEASEHOLDER may also be served as follower reads if the timestamp falls below the replica's closed timestamp, but this typically implies that the replica is in fact receiving closed timestamp updates from the leader -- it will typically take 6 seconds to trip the breaker, and the replica must have been failing at the start of this interval, implying that 6 second old follower reads will fail.

It may be reasonable to assume that the follower read timestamp is in the recent past, and that a faulty replica will remain faulty for longer than the follower read lag. Given that, we can likely omit successful follower reads when considering recent error runs, and trip the breaker regardless of successful follower reads. This will incur a latency penalty for follower reads that now have to hit other replicas, but once the follower read lag moves above the closed timestamp (typically a few seconds later) it will incur that penalty anyway.

Alternatively, we can differentiate handling of follower reads and other requests, and e.g. use separate breakers and probes for them, but it isn't clear that this is worthwhile.

Jira issue: CRDB-36391

Epic CRDB-39897

erikgrinaker commented 7 months ago

cc @nvanbenschoten in case you have opinions on this.

erikgrinaker commented 6 months ago

We discussed this recently, and concluded that as a first step we should simply add a setting (enabled by default) which ignores RoutingPolicy_NEAREST entirely in the circuit breaker -- they will not be considered at all for replica activity tracking or error/stall detection, nor will they be rejected when the breaker is tripped.

blathers-crl[bot] commented 6 months ago

Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this GA-blocker affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

erikgrinaker commented 6 months ago

Currently, the circuit breakers don't do anything in particular about follower reads. There is an initial PR in #120198 to ignore them during tracking, but it isn't clear that this is the right approach.

We may not need to or be able to do anything here for 24.1 (my inclination is to not have any special handling for them at lease for now), but I'm tentatively marking this as a GA blocker so that the KV team can make a determination.

arulajmani commented 5 months ago

We may not need to or be able to do anything here for 24.1 (my inclination is to not have any special handling for them at lease for now)

Spoke to @andrewbaptist about this. We're fine not having any special case handling for them in 24.1. Removing the GA-blocker label.