kvserver: very slow replicaGC for subsumed replicas

tbg commented 1 year ago

Describe the problem

From 23.1 testcluster. We got a node into a state where it had >3k replicas which were all merged away but the node hadn't realized (it was down for a bit, which is likely related).

A replica that was subsumed needs to wait for its left neighbor to either disappear or execute the merge. I think we had a long cascade of replicas all waiting for their respective left neighbors:

r1 < r2 < r3 < ... < r3000

It was difficult to trigger replicaGC for this case, because you need to do it in the right order. Additionally, can't use the SQL builtin, because these ranges by definition no longer exist in ranges_no_leases (the ranges were merged away).

These ranges all have their circuit breakers tripped, which means they're a confounder for using that metric usefully. Besides, they could block requests that erroneously get routed to them based on stale DistSender caches, though that should be a somewhat lesser concern since these caches would get updated quickly as the circuit breakers prevent these requests from hanging (once they engage).

To Reproduce

Unclear - the testcluster was running tpc-e on a 2h timer, so there was lots of split/scatter stuff going on, and the cluster was in a pretty bad state for several days.

Desired behavior

replicaGC is more snappy. Ideally we break the dependency on the left-neighbor, though that is likely tricky since merge correctness hinges on it. But, we could say that replicaGC for range N, if it detects a merge, queues the leftmost adjacent local replica for replicaGC; this will then knock out one replica with each replicaGC invoked, so the knot should loosen much faster.

We could also make replicaGC timings more aggressive, and check if our existing heuristics are lacking in this scenario. My expectation would be that all such ranges would've replicaGC'ed themselves quickly, since they would be rejected by their former peers. Perhaps subsumed ranges don't have the benefit of receiving a ReplicaTooOldError (since the peers don't even have any replica any more!); maybe RangeNotFound should also expedite replicaGC.

x-ref https://github.com/cockroachdb/cockroach/issues/101999

Jira issue: CRDB-27209

tbg commented 1 year ago

Nothing new but some more analysis here; the problem persisted for days and it lead to load imbalance because n2 (affected node) was underutilized. It also prevents draining the node cleanly.

Most replicaGC invocations fail with variants of this

I230425 08:43:21.353615 963367 kv/kvserver/replica_gc_queue.go:350 ⋮ [T1,n2,replicaGC,s2,r170380/4:‹/Table/711/1/-83{7566…-5724…}›] 11 left neighbor r170403:‹/Table/711/1/-86{52094248358276208-33665932600324608}› [(n7,s7):1, (n2,s2):4, (n6,s6):3, next=5, gen=11502] not up-to-date with meta descriptor r127164:‹/Table/{522-15894}› [(n7,s7):1, (n1,s1):7, (n5,s5):3, (n4,s4):8, (n8,s8):10, next=12, gen=135785]; cannot safely GC range yet

This is O(n2) behavior: on each complete pass through the replicas, we get to delete just ~1. So we do another pass, etc.

I tried some manual enqueue stuff but it suffers the same problem, you need to compute the right order of things, and it's cumbersome.

We ended up decommissioniong and re-adding the node.

williamkulju commented 10 months ago

hit this again while scale testing 23.2. Thread is here.

cockroachdb / cockroach

kvserver: very slow replicaGC for subsumed replicas #102000