kvserver: nodes flapping on their liveness can stall cluster recovery operations

aayushshah15 commented 2 years ago

When trying to determine what action to take on a range, the allocator will first determine if the range can currently achieve quorum. Ranges that are deemed to have a majority quorum of their replicas on non-live nodes (based on the current state of their node liveness records) are deemed to be unavailable and are not immediately acted upon (they are re-considered after a full scanner interval, which is 10 mins by default). The reason this check exists is because we want the replicate queue to avoid processing ranges that are unavailable, lest it stays blocked on that one range until its processing timeout is hit.

Unfortunately, a non-live node liveness record doesn't always mean that the node is actually dead. It could simply be a symptom of an oversaturated node struggling to heartbeat its liveness. Furthermore, the presence of these nodes is often cause for manual intervention, where we often try to decommission these nodes out of the cluster to avoid instability.

Effectively, this means that operations like decommissioning these struggling nodes often stall or have sporadic slowdowns depending on these nodes' liveness records at the time their ranges are checked by the replicate queue for processing.

Jira issue: CRDB-14665

Epic CRDB-39952

aayushshah15 commented 2 years ago

The reason this check exists is because we want the replicate queue to avoid processing ranges that are unavailable, lest it stays blocked on that one range until its processing timeout is hit.

I wonder how much of this caution is necessary after @tbg's circuit breaker work.

nvanbenschoten commented 2 years ago

Even with circuit breakers, proposing a write on a range that has yet to detect unavailability is going to block for 60 seconds. I don’t know that we want to open the floodgates for the replicate queue to wait 60s on each unavailable range, indefinitely.

Perhaps we should be drawing a distinction between DEAD nodes and UNAVAILABLE nodes (in the node liveness sense). If a node has been offline for 5+ minutes, there's very little chance that it will contribute to a timely consensus vote. However, if it's been offline for less than server.time_until_store_dead, it could be flapping. By drawing this distinction in liveAndDeadReplicas (which currently ignores storeStatusUnknown), we could cap the degree of disruption that a dead node could cause to the replicateQueue (up to 5 minutes of blocking) while ensuring that it still attempts to remove replicas from flapping nodes.

Concretely, I'm wondering if we want the following change:

diff --git a/pkg/kv/kvserver/store_pool.go b/pkg/kv/kvserver/store_pool.go
index dc6e6b6c11..bba34c2be1 100644
--- a/pkg/kv/kvserver/store_pool.go
+++ b/pkg/kv/kvserver/store_pool.go
@@ -707,6 +707,9 @@ func (sp *StorePool) liveAndDeadReplicas(
                        // and should be used for up-replication if necessary.
                        liveReplicas = append(liveReplicas, repl)
                case storeStatusUnknown:
+                       if includeSuspectAndDrainingStores {
+                               liveReplicas = append(liveReplicas, repl)
+                       }
                // No-op.
                case storeStatusSuspect, storeStatusDraining:
                        if includeSuspectAndDrainingStores {

aayushshah15 commented 2 years ago

By drawing this distinction in liveAndDeadReplicas (which currently ignores storeStatusUnknown), we could cap the degree of disruption that a dead node could cause to the replicateQueue (up to 5 minutes of blocking)

That makes sense. I'm onboard with the proposal.

irfansharif commented 2 years ago

Unfortunately, a non-live node liveness record doesn't always mean that the node is actually dead. It could simply be a symptom of an oversaturated node struggling to heartbeat its liveness.

Do we expect this to be a problem in v22.1+ with liveness requests bypassing all admission queuing and admission control throttling all other operations? (mod those not integrated fully, yet, like follower writes/snapshot ingestions.) In anycase, the patch suggested above seems like a reasonable defense-in-depth thing to have.

aayushshah15 commented 2 years ago

@AlexTalks Do you think we should address this during stability? It seems like a small enough fix that's not too risky.

cockroachdb / cockroach

kvserver: nodes flapping on their liveness can stall cluster recovery operations #79266