When running a rolling restart of nodes in a large cluster (v22.2.7), we noticed that nodes that were recently restarted tended to have many paused followers due to IO overload. The IO overload is due to the restarting node receiving a large amount of raft catch up traffic.
If restarts occur back-to-back, it is possible for temporary range unavailability to result as the (1) last restarted node is IO overloaded (paused replicas) and (2) the restarting node is offline for a short period between stopping and starting.
This is most likely to be problematic when there is a small window between restarts, a larger cluster with moderate-heavy write traffic.
To Reproduce
This was on CC with 100 nodes, where a rolling restart was conducted. I haven't tried to reproduce in a smaller roachprod cluster yet, however the steps I'd take are:
Setup 9 node CRDB cluster running v22.2.7
Run the indexes workload on the cluster.
Enable follower pausing with a 1.0 threshold.
Send terminate signal to the CRDB processes in a rolling fashion and restart them.
Observe if there are log messages with raft unavailability such asL
Describe the problem
When running a rolling restart of nodes in a large cluster (v22.2.7), we noticed that nodes that were recently restarted tended to have many paused followers due to IO overload. The IO overload is due to the restarting node receiving a large amount of raft catch up traffic.
If restarts occur back-to-back, it is possible for temporary range unavailability to result as the (1) last restarted node is IO overloaded (paused replicas) and (2) the restarting node is offline for a short period between stopping and starting.
This is most likely to be problematic when there is a small window between restarts, a larger cluster with moderate-heavy write traffic.
To Reproduce
This was on CC with 100 nodes, where a rolling restart was conducted. I haven't tried to reproduce in a smaller roachprod cluster yet, however the steps I'd take are:
Expected behavior
Follower pausing and rolling restarts play nicely together, so that ranges don't become unavailable transiently during restart.
Environment:
Additional context
Workload hard stalled, QPS dropped near 0, from 100k+.
Related issue from same cluster: https://github.com/cockroachdb/cockroach/issues/101315
Jira issue: CRDB-26900
Epic CRDB-39900