In a customer case, after changing from a multi-store to single-store environment, the system appeared to work correctly, however a few days later it experienced a number of problems with routing in DistSender. The underlying issue was the abandoned replicas were retried forever and eventually prevented other replicas from being routed.
To Reproduce
1) Convert a single-store system to multi-store.
2) Let it sit for a little bit to rebalance.
3) Create and delete some tables and data while it is in this configuration.
4) Temporarily disable the replicaGC queue
5) Convert it from multi-store back to single store by removing the disks gradually one at a time and letting the system rebalance data.
6) Turn the replicaGC queue back on
7) Notice the errors in the logs on nodes - they will look like:
kv/kvclient/kvcoord/dist_sender.go:1696 slow range RPC: have been waiting 60.96s (65 attempts) for RPC PushTxn(...) to r22298: [(n7,s29):51, (n4,s22):35, (n1,s17):53, next=54, gen=2595, sticky=1688401984.449023734,0]; resp: failed to send RPC: sending to all replicas failed; last error: store 22 was not found
Expected behavior
Dist sender should not get stuck on a replica that returns a StoreNotFoundError. Instead it should fail the request quickly if all stores return that error.
Describe the problem
In a customer case, after changing from a multi-store to single-store environment, the system appeared to work correctly, however a few days later it experienced a number of problems with routing in DistSender. The underlying issue was the abandoned replicas were retried forever and eventually prevented other replicas from being routed.
To Reproduce
1) Convert a single-store system to multi-store. 2) Let it sit for a little bit to rebalance. 3) Create and delete some tables and data while it is in this configuration. 4) Temporarily disable the replicaGC queue 5) Convert it from multi-store back to single store by removing the disks gradually one at a time and letting the system rebalance data. 6) Turn the replicaGC queue back on 7) Notice the errors in the logs on nodes - they will look like:
Expected behavior Dist sender should not get stuck on a replica that returns a
StoreNotFoundError
. Instead it should fail the request quickly if all stores return that error.Jira issue: CRDB-30166