Open glennfawcett opened 3 years ago
To expand upon the concept of node health, we could include performance metrics as part of the acceptance criteria. For instance, if the P99 read latency for a given node increases far above other nodes, it should begin shedding leases and replicas. This would provide clusters with a 'least latency lease policy' that would help unhealthy nodes impact the overall health of the cluster from an application's point of view. As an acceptance test, the nodes (CPU, Network, IO) resources should be saturated so as to impact the P99 latency and show observe the lease transfers. Basically, the goal is to refuse entry of bad-actors to the cluster and reduce workload pressure when latency is high on specific nodes. This is already done with load based range splitting, but should be expanded to include performance criteria
@lunevalex I disagree on your triage action. It looks to me that the request is for the replica allocator to avoid flapping nodes. That's a replication project, not server.
Discussed with @knz there are multiple things that could be done here both in the KV and Server components.
We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!
Is your feature request related to a problem? Please describe.
When a node is sick and in a panic loop a single node can destabilize the the whole cluster.
Describe the solution you'd like
It would be nice to have an exponential backoff on sick nodes in the cluster. If the recycle time and frequency was recorded and referenced when a new node joins a cluster, some heuristics could be added to pause before accepting ranges, leases, and connections. Basically, wait some set of time before a node becomes a full member.... something like a "PID controller" for cluster admission.
Additional context
This was seen with https://github.com/cockroachdb/cockroach/pull/61818 . When a the new binary was deployed to just the most sick node, the cluster became stable pretty quickly.
Jira issue: CRDB-2718