cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.8k forks source link

kvflowcontrol,admission: use flow control during raft log catchup post node-restart #98710

Open irfansharif opened 1 year ago

irfansharif commented 1 year ago

Is your feature request related to a problem? Please describe.

We've seen in write-heavy workloads that node restarts can result in LSM inversion due to a rapid onset of raft log catchup appends. This problem was touched on recently in https://github.com/cockroachdb/cockroach/issues/96521 + https://github.com/cockroachdb/cockroach/issues/95159 -- those issues amounted to 23.1 changes to avoid an immediate transfer of leases to newly-restarted nodes until their LSM is healthier, in order to stave off latency impact for leaseholder traffic. But it's possible to invert the LSM, which affects non-leaseholder traffic.

This issue proposes using the general flow control mechanism we're introducing in #95563 to pace the rate of catchup raft log appends to prevent LSM inversion entirely. With such a mechanism, we'd be able to transfer leases immediately to newly restarted nodes without lease-holder impact, and also avoid latency impact on follower traffic. https://github.com/cockroachdb/cockroach/issues/80607 is slightly related -- we could apply flow tokens to raft snapshots too to cover the general case of "catchup write traffic".

Describe the solution you'd like

https://github.com/cockroachdb/cockroach/blob/b84f10c0724f75eaf1ab00ebbcf60cd0d8607612/pkg/kv/kvserver/kvflowcontrol/doc.go#L310-L328

https://github.com/cockroachdb/cockroach/blob/b84f10c0724f75eaf1ab00ebbcf60cd0d8607612/pkg/kv/kvserver/kvflowcontrol/doc.go#L353-L375

This is also touched on here: https://reviewable.io/reviews/cockroachdb/cockroach/96642#-NOF0fl20XuBrK4Ywlzp.

Jira issue: CRDB-25460

irfansharif commented 1 year ago

https://github.com/cockroachdb/cockroach/issues/101321 talks about how follower pausing post-restart due to L0 overload as a result of raft log catchup and a rolling restart of another node could lead to range unavailability. If we used flow control during raft log catchup, follower pausing would not kick in on the first node. We'd still want to make sure that we surface in our healthcheck API the fact that a node's being caught up actively using flow tokens, or even that there's follower pausing kicking in somewhere, so operators don't restart another node while under-replicated, lest they lose quorum. (If a follower is paused, or a node is being caught up via flow tokens, there are some ranges that are "under-replicated" in that they don't have all log entries.)

dshjoshi commented 1 month ago

@nicktrav @sumeerbhola Is this taken care by RAC v2? If so, can we close this out?

nicktrav commented 1 month ago

Let's keep this open until that work completes and we can re-evaluate.