Open DuskEagle opened 1 year ago
For example, the critical nodes endpoint always reports that all nodes in a 3-node cluster are critical, because the default replication factor for system ranges is 5 (though elsewhere CockroachDB makes allowances that a 3-node cluster only has a effective replication factor of 3).
I would think this is a bug that could be quickly closed. I don't think the under-replicated ranges endpoint suffers from this problem, so we have the code needed to fix the bug in CRDB already.
Is your feature request related to a problem? Please describe. It should be possible to determine if a node can be shutdown without it resulting in unavailable ranges in the cluster. This is needed to make rolling restarts of a cluster safer, e.g. when upgrading the CRDB version on each node.
Describe the solution you'd like The ideal experience for us would be that
cockroach node drain
should not complete successfully if shutting down the node would result in unavailable ranges. This behavior could optionally be gated behind a--ensure-safe-shutdown
flag if backwards compatibility is a concern.With this, we could attempt to drain a node, and if the drain doesn't finish successfully, we could hold off on restarting the node until the issue preventing it from completing is resolved.
Describe alternatives you've considered We have considered using the critical nodes HTTP endpoint to determine if a node is safe to remove. This falls short today for one key reason and a couple of more minor ones:
cockroach node drain
. Many things could happen inside CockroachDB during that time (e.g. replicas moving to new nodes). CockroachDB is better equipped to detect if a node is safe to shutdown at a given time then an orchestrator outside of CockroachDB is, which has to run disjointed requests to CRDB to check if a node is safe to shutdown before beginning the shutdown process.critical_nodes
endpoint is only available via HTTP, whilecockroach node drain
is only available via the CLI.Still, we would be open to using the critical nodes HTTP endpoint as an alternative to the altered drain behavior, if the false positive issues were resolved.
Additional context Of course, an unrelated node could crash in between when
cockroach node drain
finishes and when we shut down the target CRDB node. An unrelated node could also crash while the target CRDB node is shut down. These are edge cases which we can accept on clusters where the replication factor is set to 3, as long as we can detect the more common case that a cluster is already in an underreplicated state before we shut down a node.https://cockroachlabs.atlassian.net/browse/SREOPS-6650 is just one example of a case where this feature request would've helped us. In this case, a cluster was scaled from one node to three. We then performed a second rolling restart to change the underlying VM type, before the cluster had finished upreplicating its ranges to all three nodes. This caused temporary unavailability of the cluster. With this feature, we would've delayed the rolling restart until it was safe to perform.
Jira issue: CRDB-28257
Epic CRDB-39899