cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.12k stars 3.81k forks source link

Clear signal node is safe to shutdown #103910

Open DuskEagle opened 1 year ago

DuskEagle commented 1 year ago

Is your feature request related to a problem? Please describe. It should be possible to determine if a node can be shutdown without it resulting in unavailable ranges in the cluster. This is needed to make rolling restarts of a cluster safer, e.g. when upgrading the CRDB version on each node.

Describe the solution you'd like The ideal experience for us would be that cockroach node drain should not complete successfully if shutting down the node would result in unavailable ranges. This behavior could optionally be gated behind a --ensure-safe-shutdown flag if backwards compatibility is a concern.

With this, we could attempt to drain a node, and if the drain doesn't finish successfully, we could hold off on restarting the node until the issue preventing it from completing is resolved.

Describe alternatives you've considered We have considered using the critical nodes HTTP endpoint to determine if a node is safe to remove. This falls short today for one key reason and a couple of more minor ones:

  1. Too many false positives. For example, the critical nodes endpoint always reports that all nodes in a 3-node cluster are critical, because the default replication factor for system ranges is 5 (though elsewhere CockroachDB makes allowances that a 3-node cluster only has a effective replication factor of 3).
  2. There's an unavoidable amount of time between when we query the critical nodes endpoint and when we would issue cockroach node drain. Many things could happen inside CockroachDB during that time (e.g. replicas moving to new nodes). CockroachDB is better equipped to detect if a node is safe to shutdown at a given time then an orchestrator outside of CockroachDB is, which has to run disjointed requests to CRDB to check if a node is safe to shutdown before beginning the shutdown process.
  3. It's slightly annoying that the critical_nodes endpoint is only available via HTTP, while cockroach node drain is only available via the CLI.

Still, we would be open to using the critical nodes HTTP endpoint as an alternative to the altered drain behavior, if the false positive issues were resolved.

Additional context Of course, an unrelated node could crash in between when cockroach node drain finishes and when we shut down the target CRDB node. An unrelated node could also crash while the target CRDB node is shut down. These are edge cases which we can accept on clusters where the replication factor is set to 3, as long as we can detect the more common case that a cluster is already in an underreplicated state before we shut down a node.

https://cockroachlabs.atlassian.net/browse/SREOPS-6650 is just one example of a case where this feature request would've helped us. In this case, a cluster was scaled from one node to three. We then performed a second rolling restart to change the underlying VM type, before the cluster had finished upreplicating its ranges to all three nodes. This caused temporary unavailability of the cluster. With this feature, we would've delayed the rolling restart until it was safe to perform.

Jira issue: CRDB-28257

Epic CRDB-39899

joshimhoff commented 1 year ago

For example, the critical nodes endpoint always reports that all nodes in a 3-node cluster are critical, because the default replication factor for system ranges is 5 (though elsewhere CockroachDB makes allowances that a 3-node cluster only has a effective replication factor of 3).

I would think this is a bug that could be quickly closed. I don't think the under-replicated ranges endpoint suffers from this problem, so we have the code needed to fix the bug in CRDB already.