kvcoord: After changing from multi-store to single-store nodes, abandoned replicas can get stuck in the GC queue

andrewbaptist commented 1 year ago

Describe the problem

In a customer case, after changing from a multi-store to single-store environment, the system appeared to work correctly, however a few days later it experienced a number of problems with routing in DistSender. The underlying issue was the abandoned replicas were retried forever and eventually prevented other replicas from being routed.

To Reproduce

1) Convert a single-store system to multi-store. 2) Let it sit for a little bit to rebalance. 3) Create and delete some tables and data while it is in this configuration. 4) Temporarily disable the replicaGC queue 5) Convert it from multi-store back to single store by removing the disks gradually one at a time and letting the system rebalance data. 6) Turn the replicaGC queue back on 7) Notice the errors in the logs on nodes - they will look like:

kv/kvclient/kvcoord/dist_sender.go:1696  slow range RPC: have been waiting 60.96s (65 attempts) for RPC PushTxn(...) to r22298: [(n7,s29):51, (n4,s22):35, (n1,s17):53, next=54, gen=2595, sticky=1688401984.449023734,0]; resp:  failed to send RPC: sending to all replicas failed; last error: store 22 was not found

Expected behavior Dist sender should not get stuck on a replica that returns a StoreNotFoundError. Instead it should fail the request quickly if all stores return that error.

Jira issue: CRDB-30166

andrewbaptist commented 1 year ago

This is related to https://github.com/cockroachdb/cockroach/issues/74503 also.

erikgrinaker commented 1 year ago

I think this can happen any time a store is removed, regardless of how many stores we end up with?

cockroachdb / cockroach

kvcoord: After changing from multi-store to single-store nodes, abandoned replicas can get stuck in the GC queue #107699