Open aayushshah15 opened 2 years ago
The reason this check exists is because we want the replicate queue to avoid processing ranges that are unavailable, lest it stays blocked on that one range until its processing timeout is hit.
I wonder how much of this caution is necessary after @tbg's circuit breaker work.
Even with circuit breakers, proposing a write on a range that has yet to detect unavailability is going to block for 60 seconds. I don’t know that we want to open the floodgates for the replicate queue to wait 60s on each unavailable range, indefinitely.
Perhaps we should be drawing a distinction between DEAD
nodes and UNAVAILABLE
nodes (in the node liveness sense). If a node has been offline for 5+ minutes, there's very little chance that it will contribute to a timely consensus vote. However, if it's been offline for less than server.time_until_store_dead
, it could be flapping. By drawing this distinction in liveAndDeadReplicas
(which currently ignores storeStatusUnknown
), we could cap the degree of disruption that a dead node could cause to the replicateQueue
(up to 5 minutes of blocking) while ensuring that it still attempts to remove replicas from flapping nodes.
Concretely, I'm wondering if we want the following change:
diff --git a/pkg/kv/kvserver/store_pool.go b/pkg/kv/kvserver/store_pool.go
index dc6e6b6c11..bba34c2be1 100644
--- a/pkg/kv/kvserver/store_pool.go
+++ b/pkg/kv/kvserver/store_pool.go
@@ -707,6 +707,9 @@ func (sp *StorePool) liveAndDeadReplicas(
// and should be used for up-replication if necessary.
liveReplicas = append(liveReplicas, repl)
case storeStatusUnknown:
+ if includeSuspectAndDrainingStores {
+ liveReplicas = append(liveReplicas, repl)
+ }
// No-op.
case storeStatusSuspect, storeStatusDraining:
if includeSuspectAndDrainingStores {
By drawing this distinction in liveAndDeadReplicas (which currently ignores storeStatusUnknown), we could cap the degree of disruption that a dead node could cause to the replicateQueue (up to 5 minutes of blocking)
That makes sense. I'm onboard with the proposal.
Unfortunately, a non-live node liveness record doesn't always mean that the node is actually dead. It could simply be a symptom of an oversaturated node struggling to heartbeat its liveness.
Do we expect this to be a problem in v22.1+ with liveness requests bypassing all admission queuing and admission control throttling all other operations? (mod those not integrated fully, yet, like follower writes/snapshot ingestions.) In anycase, the patch suggested above seems like a reasonable defense-in-depth thing to have.
@AlexTalks Do you think we should address this during stability? It seems like a small enough fix that's not too risky.
When trying to determine what action to take on a range, the allocator will first determine if the range can currently achieve quorum. Ranges that are deemed to have a majority quorum of their replicas on non-live nodes (based on the current state of their node liveness records) are deemed to be unavailable and are not immediately acted upon (they are re-considered after a full scanner interval, which is 10 mins by default). The reason this check exists is because we want the replicate queue to avoid processing ranges that are unavailable, lest it stays blocked on that one range until its processing timeout is hit.
Unfortunately, a non-live node liveness record doesn't always mean that the node is actually dead. It could simply be a symptom of an oversaturated node struggling to heartbeat its liveness. Furthermore, the presence of these nodes is often cause for manual intervention, where we often try to decommission these nodes out of the cluster to avoid instability.
Effectively, this means that operations like decommissioning these struggling nodes often stall or have sporadic slowdowns depending on these nodes' liveness records at the time their ranges are checked by the replicate queue for processing.
Jira issue: CRDB-14665
Epic CRDB-39952