cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30k stars 3.79k forks source link

Document / implement a procedure for *safely* removing a single problematic CRDB node with MTTR < 1 hour #73763

Closed joshimhoff closed 1 year ago

joshimhoff commented 2 years ago

Is your feature request related to a problem? Please describe. Sometimes a single CRDB node has a store local problem that greatly hurts SQL reliability. As an example consider an "inverted" LSM. This problem often affects a single CRDB node. The node may still hold leases and thus stand in the critical path of the user. Often times the LSM issues also lead to liveness problems and difficulty holding leases. The node flapping between live & not live leads to high SQL tail latencies (tho not so much anymore on master due tohttps://github.com/cockroachdb/cockroach/pull/63999). The list of possible problems goes on: IIUC, a node with an "inverted" LSM that holds no leases but does hold replicas can lag behind other replicas and thus backpressure writes, even if cockroach node drain has been run. The key thing is that there is a single CRDB node that has bad behavior; the cluster would perform better if it simply wasn't running at all!

What we need is a fast and safe way for operators like SREs to fix a single CRBD node issue like what is described above.

Existing options are:

  1. Run an offline compaction. We reject this as it is too slow. It can take >= 4 hours on a 900GB x 4 node cluster, and also cockroach compact doesn't report progress.
  2. Decommission the node and then replace it with a new one (can flip the order). We reject this as it it too slow. It can take ~ 1 day to decommission a single CRDB node with a 900 GB disk.

Describe the solution you'd like The basic idea is that to fix a problem like this, we can just remove the problem node permanently from the cluster. Data is already replicated.

Here is the strawman procedure:

  1. cockroach node drain the leases out of the problem node.
  2. Check that no ranges are under-replicated.
  3. Stop the CRDB process on the problem node. ------- NO MORE CUSTOMER IMPACT MODULO LOSS OF CAPACITY --------------
  4. Wait for the node to be marked as not live. Wait for no under-replicated ranges. Check for no kvprober errors or high latency.
  5. Decommission the dead node (which since dead should just do clean up, NOT any replica rebalancing).
  6. Remove the problem node permanently.
  7. Spin up a replacement node. (This can also be done earlier, tho either way you lose capacity for a while, as it takes time to move replicas over. This can be mitigated by vertically scaling the cluster before running the op.)

The strawman op should be quick to execute on. Seems feasible to me to execute on the whole thing in < 1 hour! Note that once step 3 is executed on, there should be no customer impact. But we want to keep the overall run time < 1 hour, both to keep SRE ops load down, and to reduce the time in the lower capacity state (tho note that capacity remains reduced until replicas rebalance onto the new node, which takes much longer than 1 hour in many real world scenarios).

Questions for DB folks:

  1. Overall feelings about this procedure?
  2. Any suggested changes to the procedure?
  3. Any worries about the safety of the strawman op? To close this ticket, we must feel it is safe enough that an SRE can execute on it without DB eng involvement. I worry a bit about step 4, as both the under-replicated ranges metric & the kvprober metrics lag problem start time. It would be better to have a more robust signal that it is safe to remove a node without decommissioning.

Note that I opened this as SRE considered doing this to a recent cloud cluster but weren't sure enough safety-wise to do with without talking to multiple DB eng teams (KV & storage).

Describe alternatives you've considered See "existing options" up above.

Additional context This leads to both a high MTTR and a high ops load affecting SREs in a cloud context.

Jira issue: CRDB-11734

lunevalex commented 2 years ago

This is not really true The node flapping between live & not live leads to high SQL tail latencies. anymore after this #63999. Doing this procedure would result in a potentially precarious scenario, where you would be underreplicated for the duration of this incident. if another node fails at the same time you would end up with LOQ with RF=3, with reasonable certainty. To remove a node permanently from a cluster you have to decommission it at some point. If you do it after it's dead it should result in just some cleanup.

There is a proposal here that would make this type of situation better #68586

joshimhoff commented 2 years ago

Doing this procedure would result in a potentially precarious scenario, where you would be underreplicated for the duration of this incident. if another node fails at the same time you would end up with LOQ with RF=3, with reasonable certainty.

This is true but are we sure that the other approaches are actually safer than the one suggested here? I'm not! Also they def take way longer, which is a bad thing for both reliability and ops load.

In the offline compaction case, the node must be taken offline and must stay offline for quite some time (>= 6 hours on recent cloud cluster). Loss of another node can lead to unavailable ranges. What's the concrete difference in terms of risk?

In the decommission case, the node stays online, so no quorum-y risk. The big problem here is that until all replicas have replicated away, which can take many many many hours (read: > 1 day), the node may backpressure writes, even if cockroach drain has been called already. Do I understand that correctly? If I do, I again think in terms of aggregate risk, the approach described here may be better, despite that there are risks associated with following it. Do you see it differently, @lunevalex? I guess I do see how durability-wise, the decommissioning approach is safer. But availability-wise, I am not convinced!

There is a proposal here that would make this type of situation better #68586

I'm not sure I see how. IIUC, that issue suggests a new API that does rebalancing just the way decommissioning does. The issue with blocking on rebalancing / decommissioning is then fixing the problem takes too long.

This is not really true The node flapping between live & not live leads to high SQL tail latencies. anymore after this #63999.

Nice!!

To remove a node permanently from a cluster you have to decommission it at some point. If you do it after it's dead it should result in just some cleanup.

I fixed the strawman procedure up; thx!

lunevalex commented 2 years ago

You are correct with #68586 you still have to wait for data to copy. It just allows you to deal with this without having to add a new node, which may be operationally simple.

Realistically there is really no good answer here, your choices are keep the node online and wait for data to copy and potentially have it impact availability or kill it and then up-replicate and hope that this will not resolve in an adverse LOQ event. Given the state of our LOQ tooling, if you end up in that scenario it will be a lot more painful to deal with than the initial problem. So doing anything you can to avoid LOQ in the current state is probably the safest approach at least until 22.1 when the tooling will improve but even then there is a long way to go to make that tooling bulletproof.

The question is how often do you see backpressure become a problem when you are doing a decommission to fix a bad LSM?

knz commented 2 years ago

In #72754, we're proposing that decommissioning should only be possible after a node has been shut down. In light of this, how is your suggested process (1 - shut node down 2 - wait for liveness expiry 3 - decommission) different from our (future) recommended decommissioning process?

bdarnell commented 2 years ago
  1. cockroach node drain the leases out of the problem node.
  2. Check that no ranges are under-replicated.
  3. Stop the CRDB process on the problem node. ------- NO MORE CUSTOMER IMPACT MODULO LOSS OF CAPACITY --------------
  4. Wait for the node to be marked as not live. Wait for no under-replicated ranges. Check for no kvprober errors or high latency.
  5. Decommission the dead node (which since dead should just do clean up, NOT any replica rebalancing).
  6. Remove the problem node permanently.
  7. Spin up a replacement node. (This can also be done earlier, tho either way you lose capacity for a while, as it takes time to move replicas over. This can be mitigated by vertically scaling the cluster before running the op.)

The strawman op should be quick to execute on. Seems feasible to me to execute on the whole thing in < 1 hour!

The problem is waiting for all the underreplicated ranges to recover. On paper, this should take just as long as decommissioning, because decommissioning and recovering from node failure are more or less the exact same process. The only difference is that because decommissioning leaves the node up, you aren't consuming any of your fault tolerance budget in the process. If decommissioning is taking significantly longer than the MTTR of a simple node failure, that's a bug that should be addressed.

Decommissioning is (intended to be) the fastest way to remove a node from the cluster without consuming any of your fault tolerance budget (which is just a single node if you have any RF=3 ranges). But if a node is so unhealthy that it's dragging the rest of the cluster down, it's reasonable to deem it failed and kill it without decommissioning (but with draining if possible). You'll be under-replicated for a time, but you'd be in exactly the same state if the node failed in a more abrupt way, so you could just chalk this up against your failure budget and proceed. And in the worst case of a second node failure during this recovery, you still have the data on disk so you could restart the faulty node instead of going through a loss-of-quorum recovery.

Overall I endorse this procedure when one node is causing problems for the rest of the cluster. It's not worth keeping a flaky node around in this state just to avoid preemptively consuming failure budget.

However, @knz said:

In light of this, how is your suggested process (1 - shut node down 2 - wait for liveness expiry 3 - decommission) different from our (future) recommended decommissioning process?

So you intend to cause ranges to be underreplicated any time a node gets decommissioned? That seems problematic to me. The entire purpose of decommissioning is to have a non-failure way to remove all ranges from a live healthy node. We need to continue to supply that (maybe we decouple the "remove replicas" process from the final "permanent removal" step, but we do need a process that doesn't begin with shutting the node down).

More comments on reported timings:

Run an offline compaction. We reject this as it is too slow. It can take >= 4 hours on a 900GB x 4 node cluster, and also cockroach compact doesn't report progress.

The lack of progress reporting can surely be improved, but I agree that this is not a very attractive option. Offline compactions are only helpful for certain kinds of problems ("inverted LSM"). They also scale poorly - as storage density increases, compactions will take longer and there's not much we can do to speed them up.

Decommission the node and then replace it with a new one (can flip the order). We reject this as it it too slow. It can take ~ 1 day to decommission a single CRDB node with a 900 GB disk.

This is surprising and worrying. How does this compare to the time to recover from an actual node failure? If node failures recover faster, there's some sort of bug in decommissioning. If they don't, then there's a bigger problem in that we have a substantial risk of a second failure before the first recovery can complete. We must be able to re-replicate a node's data in a fairly short period of time (using the aggregate bandwidth of the whole cluster).

The scaling math of recovery is complex - As storage density increases, recovery slows down because there's more data to recover. But as the number of nodes in the cluster increases, it speeds up because there is more aggregate bandwidth. Maybe there's an argument analogous to "Why RAID 5 stops working" that high storage densities require either a high node count (to speed recovery) or a high replication factor (so that a failure concurrent with a recovery doesn't lead to critical data loss).

knz commented 2 years ago

In light of this, how is your suggested process (1 - shut node down 2 - wait for liveness expiry 3 - decommission) different from our (future) recommended decommissioning process?

So you intend to cause ranges to be underreplicated any time a node gets decommissioned? That seems problematic to me.

hmm :thinking: this is input that would have been valuable in the discussion for #72754. I'll push it there.

joshimhoff commented 2 years ago

I'll respond in more detail later but...

This is surprising and worrying. How does this compare to the time to recover from an actual node failure? If node failures recover faster, there's some sort of bug in decommissioning. If they don't, then there's a bigger problem in that we have a substantial risk of a second failure before the first recovery can complete. We must be able to re-replicate a node's data in a fairly short period of time (using the aggregate bandwidth of the whole cluster).

Screen Shot 2021-12-14 at 1 13 33 PM

It took 21 hours to replicate away all data from n2.

Default value for kv.snapshot_rebalance.max_rate AND kv.snapshot_recovery.max_rate is 8MIB/s.

There was around 500GB of logical bytes on n2.

500 GB / 8.0 MiB per second -> 16.5568458 hours.

Perhaps the remaining 5 hours can be chalked up to n2 slowness...

Actually tho, we doubled kv.snapshot_rebalance.max_rate to 16 MIB/s. So it should have finished in 8 hours according to my math.

I don't have a detailed metal model (e.g. are above distributed rate limits? is logical bytes right graph to consider?). So I expect above back of envelope calculation may have conceptual errors. Plz correct me. How fast do we expect decommissioning to happen?

One other thing I'd add is that not a single time in cloud history has a CRDB node been permanently lost. Loss of availability -> yes many times. Permanent loss of disk -> never. Note we not depend on actual local disks attached to the VMs!!

bdarnell commented 2 years ago

500 GB / 8.0 MiB per second -> 16.5568458 hours.

It's more complicated than that: the snapshot rate limits apply per node, so it depends on the size of the cluster (or more precisely the number of eligible peer nodes - in a multi-region cluster only the nodes in the same region count for region-restricted data). So in a 16-node cluster the back of the envelope math would say you could recover 500GB in a little over an hour (with the default of 8 MiB/s). In a cluster that's limping along with only two surviving nodes, it'd take 8 hours.

Here of course you're seeing much worse performance than this math implies. The snapshot rate limits aren't the only factor here - after the snapshot is transferred it must be written to disk before another snapshot can be accepted, for example. But my mental model is that the snapshot rate limit captures a majority of the time taken here, so there's something unknown to me slowing things down.

One other thing I'd add is that not a single time in cloud history has a CRDB node been permanently lost.

Good point. So slow recovery times aren't as catastrophic for CC as they are for self-hosted deployments with less disk redundancy.

As data density recommendations increase we should also make sure we're increasing the snapshot rate limits when we can. Now that we have better admission control, can we safely increase the current defaults of 8M? I know that we have at least one major customer who turns these settings way up (to the point that it causes other problems, but those are solvable).

lunevalex commented 2 years ago

71814 bumped snapshot rates to 32MB/S by default.

joshimhoff commented 2 years ago

Thx for explaining, @bdarnell! Cluster was running hot CPU wise. Maybe that explains the extra time. We could certainly gather more data on decommission times in CC.

Overall I endorse this procedure when one node is causing problems for the rest of the cluster. It's not worth keeping a flaky node around in this state just to avoid preemptively consuming failure budget.

@lunevalex what do you make of ^^? If SRE starts executing on the procedure strawmaned here, and if we do lose another disk before the under-replicated ranges become fully replicated again, we will certainly escalate to KV for help with recovering from loss of quorum. So I want someone KV + SRE to agree on a standard operating procedure. Maybe other KV engineers like @tbg have an opinion?

Perhaps the key Q is the one you asked:

The question is how often do you see backpressure become a problem when you are doing a decommission to fix a bad LSM?

I don't know how to check for backpressure tho. If 21.2 cluster, we can check kvprober, but the recent outage was on a 21.1 cluster, where kvprober only does reads. Is there a metric I can check?

But if there is no backpressure, so long as node is cockroach node drained, I don't think I see any downside to just letting the decommissioning process run for a long time. Perhaps our procedure should be:

  1. Drain.
  2. Decommission.
  3. Check for backpressure.
  4. Take down node if backpressure causes impact.

I think the other key thing is the lower risk of loss of quorum in a cloud context (tho goes without saying it'll happen to us eventually) as per below:

One other thing I'd add is that not a single time in cloud history has a CRDB node been permanently lost.

Good point. So slow recovery times aren't as catastrophic for CC as they are for self-hosted deployments with less disk redundancy.

tbg commented 2 years ago

@lunevalex what do you make of ^^? If SRE starts executing on the procedure strawmaned here, and if we do lose another disk before the under-replicated ranges become fully replicated again, we will certainly escalate to KV for help with recovering from loss of quorum. So I want someone KV + SRE to agree on a standard operating procedure. Maybe other KV engineers like @tbg have an opinion?

I agree that browned out nodes are more trouble than they're worth and temporarily taking them offline should usually help. If there is a separate disk failure, there is always the option of rebooting the limping node.

joshimhoff commented 2 years ago

Thanks! I'm going to write a playbook that makes clear this is an option for SRE, unless I hear objections soon. Then I will close this issue, unless someone thinks it should stay open for some reason. As a reminder, here is the procedure:

  1. cockroach node drain the leases out of the problem node.
  2. Check that no ranges are under-replicated.
  3. Stop the CRDB process on the problem node. ------- NO MORE CUSTOMER IMPACT MODULO LOSS OF CAPACITY --------------
  4. Wait for the node to be marked as not live. Wait for no under-replicated ranges. Check for no kvprober errors or high latency.
  5. Decommission the dead node (which since dead should just do clean up, NOT any replica rebalancing).
  6. Remove the problem node permanently.
  7. Spin up a replacement node. (This can also be done earlier, tho either way you lose capacity for a while, as it takes time to move replicas over. This can be mitigated by vertically scaling the cluster before running the op.)

If executing on above does go wrong and we have a LOQ event, we will learn from that, I am sure.

tbg commented 2 years ago

Not sure what the LOQ worry is about. We're not destroying a node prematurely here. There shouldn't be (permanent) LOQ as a consequence of the above since you can either restart the down node (thus being in sort of the same state as before doing anything), or the node was no longer necessary for any quorum (i.e. fully decommissioned).

joshimhoff commented 2 years ago

👍

irfansharif commented 2 years ago

See https://github.com/cockroachdb/cockroach/issues/77251 for some more discussion on this topic.

github-actions[bot] commented 1 year ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!