kvserver: rolling restarts with follower pausing can lead to temporary range unavailability

Describe the problem

When running a rolling restart of nodes in a large cluster (v22.2.7), we noticed that nodes that were recently restarted tended to have many paused followers due to IO overload. The IO overload is due to the restarting node receiving a large amount of raft catch up traffic.

If restarts occur back-to-back, it is possible for temporary range unavailability to result as the (1) last restarted node is IO overloaded (paused replicas) and (2) the restarting node is offline for a short period between stopping and starting.

This is most likely to be problematic when there is a small window between restarts, a larger cluster with moderate-heavy write traffic.

To Reproduce

This was on CC with 100 nodes, where a rolling restart was conducted. I haven't tried to reproduce in a smaller roachprod cluster yet, however the steps I'd take are:

Setup 9 node CRDB cluster running v22.2.7
Run the indexes workload on the cluster.
Enable follower pausing with a 1.0 threshold.
Send terminate signal to the CRDB processes in a rolling fashion and restart them.
Observe if there are log messages with raft unavailability such asL

W230406 00:50:23.895755 1723804188 kv/txn.go:705 ⋮ [n44,client=10.11.3.64:19274,user=‹roach›] 143007 failure aborting transaction: replica unavailable: (n6,s6):3 unable to serve request to r37576:‹/Table/109/2/-484{9033071567352390/"\x00\x00\x00\x00\x00\x00\x00\x00\xbc\xb4\xc6\x1c\xd9\x1bM\xba"-1254572419284788/"\x00\x00\x00\x00\x00\x00\x00\x00\xbc\xd0h\x9d϶\x80\xcc"}› [(n42,s42):1, (n9,s9):2, (n6,s6):3, next=4, gen=59]: closed timestamp: 1680742156.771539906,0 (2023-04-06 00:49:16); raft status: {"id":"3","term":7,"vote":"3","commit":4740189,"lead":"3","raftState":"StateLeader","applied":4740189,"progress":{"1":{"match":4740189,"next":4740396,"state":"StateReplicate"},"2":{"match":0,"next":4651570,"state":"StateSnapshot"},"3":{"match":4755332,"next":4755333,"state":"StateReplicate"}},"leadtransferee":"0"}: encountered poisoned latch ‹/Table/109/2/-4846630336183874321/"\x00\x00\x00\x00\x00\x00\x00\x00\xbc\xbdOcF \xcc\xef"/0›@1680742162.477208011,0: "sql txn" meta={id=184e5d3f key=/Table/109/1/‹"\x00\x00\x00\x00\x00\x00\x00\x00\xbc\xbdOcF \xcc\xef"›/‹0› pri=0.00017057 epo=0 ts=1680742162.477208011,0 min=1680742162.477208011,0 seq=3} lock=true stat=PENDING rts=1680742162.477208011,0 wto=false gul=1680742162.727208011,0; abort caused by: result is ambiguous: replica unavailable: (n6,s6):3 unable to serve request to r37576:‹/Table/109/2/-484{9033071567352390/"\x00\x00\x00\x00\x00\x00\x00\x00\xbc\xb4\xc6\x1c\xd9\x1bM\xba"-1254572419284788/"\x00\x00\x00\x00\x00\x00\x00\x00\xbc\xd0h\x9d϶\x80\xcc"}› [(n42,s42):1, (n9,s9):2, (n6,s6):3, next=4, gen=59]: closed timestamp: 1680742156.771539906,0 (2023-04-06 00:49:16); raft status: {"id":"3","term":7,"vote":"3","commit":4740189,"lead":"3","raftState":"StateLeader","applied":4740189,"progress":{"1":{"match":4740189,"next":4740396,"state":"StateReplicate"},"2":{"match":0,"next":4651570,"state":"StateSnapshot"},"3":{"match":4755330,"next":4755331,"state":"StateReplicate"}},"leadtransferee":"0"}: have been waiting 63.00s for slow proposal InitPut [/Table/109/2/‹-4847935104247747320›/‹"\x00\x00\x00\x00\x00\x00\x00\x00\xbc\xb8\xac\xb5I*\xad\b"›/‹0›,/Min), [txn: 7e70532e]

Expected behavior

Follower pausing and rolling restarts play nicely together, so that ranges don't become unavailable transiently during restart.

Environment:

CockroachDB version v22.2.7
Server OS: K8s Ubuntu 20.2 (not certain).
Client app: workload indexes

Additional context

Workload hard stalled, QPS dropped near 0, from 100k+.

Related issue from same cluster: https://github.com/cockroachdb/cockroach/issues/101315

Jira issue: CRDB-26900

Epic CRDB-39900

cockroachdb / cockroach

kvserver: rolling restarts with follower pausing can lead to temporary range unavailability #101321