cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.9k stars 3.78k forks source link

Exponential backoff for nodes in a panic loop #61886

Open glennfawcett opened 3 years ago

glennfawcett commented 3 years ago

Is your feature request related to a problem? Please describe.

When a node is sick and in a panic loop a single node can destabilize the the whole cluster.

Describe the solution you'd like

It would be nice to have an exponential backoff on sick nodes in the cluster. If the recycle time and frequency was recorded and referenced when a new node joins a cluster, some heuristics could be added to pause before accepting ranges, leases, and connections. Basically, wait some set of time before a node becomes a full member.... something like a "PID controller" for cluster admission.

Additional context

This was seen with https://github.com/cockroachdb/cockroach/pull/61818 . When a the new binary was deployed to just the most sick node, the cluster became stable pretty quickly.

sick_node_228 sick_node_stabilization

Jira issue: CRDB-2718

glennfawcett commented 3 years ago

To expand upon the concept of node health, we could include performance metrics as part of the acceptance criteria. For instance, if the P99 read latency for a given node increases far above other nodes, it should begin shedding leases and replicas. This would provide clusters with a 'least latency lease policy' that would help unhealthy nodes impact the overall health of the cluster from an application's point of view. As an acceptance test, the nodes (CPU, Network, IO) resources should be saturated so as to impact the P99 latency and show observe the lease transfers. Basically, the goal is to refuse entry of bad-actors to the cluster and reduce workload pressure when latency is high on specific nodes. This is already done with load based range splitting, but should be expanded to include performance criteria

knz commented 3 years ago

@lunevalex I disagree on your triage action. It looks to me that the request is for the replica allocator to avoid flapping nodes. That's a replication project, not server.

lunevalex commented 3 years ago

Discussed with @knz there are multiple things that could be done here both in the KV and Server components.

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!