Open irfansharif opened 1 year ago
https://github.com/cockroachdb/cockroach/issues/101321 talks about how follower pausing post-restart due to L0 overload as a result of raft log catchup and a rolling restart of another node could lead to range unavailability. If we used flow control during raft log catchup, follower pausing would not kick in on the first node. We'd still want to make sure that we surface in our healthcheck API the fact that a node's being caught up actively using flow tokens, or even that there's follower pausing kicking in somewhere, so operators don't restart another node while under-replicated, lest they lose quorum. (If a follower is paused, or a node is being caught up via flow tokens, there are some ranges that are "under-replicated" in that they don't have all log entries.)
@nicktrav @sumeerbhola Is this taken care by RAC v2? If so, can we close this out?
Let's keep this open until that work completes and we can re-evaluate.
Is your feature request related to a problem? Please describe.
We've seen in write-heavy workloads that node restarts can result in LSM inversion due to a rapid onset of raft log catchup appends. This problem was touched on recently in https://github.com/cockroachdb/cockroach/issues/96521 + https://github.com/cockroachdb/cockroach/issues/95159 -- those issues amounted to 23.1 changes to avoid an immediate transfer of leases to newly-restarted nodes until their LSM is healthier, in order to stave off latency impact for leaseholder traffic. But it's possible to invert the LSM, which affects non-leaseholder traffic.
This issue proposes using the general flow control mechanism we're introducing in #95563 to pace the rate of catchup raft log appends to prevent LSM inversion entirely. With such a mechanism, we'd be able to transfer leases immediately to newly restarted nodes without lease-holder impact, and also avoid latency impact on follower traffic. https://github.com/cockroachdb/cockroach/issues/80607 is slightly related -- we could apply flow tokens to raft snapshots too to cover the general case of "catchup write traffic".
Describe the solution you'd like
https://github.com/cockroachdb/cockroach/blob/b84f10c0724f75eaf1ab00ebbcf60cd0d8607612/pkg/kv/kvserver/kvflowcontrol/doc.go#L310-L328
https://github.com/cockroachdb/cockroach/blob/b84f10c0724f75eaf1ab00ebbcf60cd0d8607612/pkg/kv/kvserver/kvflowcontrol/doc.go#L353-L375
This is also touched on here: https://reviewable.io/reviews/cockroachdb/cockroach/96642#-NOF0fl20XuBrK4Ywlzp.
Jira issue: CRDB-25460