Enable bounding recovery time of LSM inversion to a configurable number of minutes, like N=60 minutes

cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.

https://www.cockroachlabs.com

Other

29.97k stars 3.79k forks source link

Enable bounding recovery time of LSM inversion to a configurable number of minutes, like N=60 minutes #86784

Open joshimhoff opened 2 years ago

joshimhoff commented 2 years ago

Is your feature request related to a problem? Please describe. A question from @jreut: Can we enable bounding recovery time of LSM inversion to a configurable number of minutes, like N=60 minutes?

Current behavior: Allow an LSM to invert indefinitely. (I believe (?) there might actually be some upper limit in pebble on number of files in L0, but experience shows that it is set so high that it doesn't bound time to fix to a reasonable number like this ticket suggests.)

The good: By allowing an LSM to invert indefinitely, perf degrades gracefully. In particular, read perf degrades further & further as the LSM inverts more & more. In theory (not 100% sure this is true), write perf shouldn't even be negatively effected, and there should never be no unavailability. In practice, since LSM inverting indicates a gap in AC, we often actually see hardware overload leading to degradations in write perf & even hard unavailability due to node liveness failures.

The bad: By allowing an LSM to invert indefinitely, the LSM gets more & more inverted. This increases time to fix of the poor read perf to a greater & greater number. Time to fix of outages of this sort can range from 4-12 hours in CC contexts! We need to resolve CC incidents faster than that, both because long outages lead to churn & because managing long outages has significant human costs.

Describe the solution you'd like Proposed new behavior:

Pebble itself should fail all writes when an LSM has gotten inverted enough that time to fix is estimated at N=60 minutes.
The actual threshold should prob be in terms of L0 sublevels or similar, but we would set it in CC based on a time to fix goal.
The pebble fail writes threshold should be greater than the AC queue threshold, since AC queueing is a more graceful failure mode than pebble failing writes.
(maybe) Don't fail critical writes such as to the node liveness data.
(maybe) Need hysteresis on that threshold.

Typical behavior in case of storage overload: AC does queueing. Nothing changes from the current behavior.

Behavior in case the AC control fails to keep an LSM from inverting:

LSM inverts but once number of L0 sublevels hits the threshold all writes to pebble fail. It is simply NOT possible for the LSM to have more sublevels than the limit.
SRE gets paged.
SRE + etc. look for the inverting workload & either stop it or close the AC gap with DB eng. LSM un-inverts, with a reasonable time to fix, e.g. N=60 minutes.
If eliminating the inverting workload is taking too much time, and SRE or customer would prefer a much longer read perf degradation to the current outage, SRE or customer can up the pebble fail at this many sublevels in L0 setting, or even turn it off completely! It is always possible for a human in the loop to revert to the current behavior, which improves the customer experience at outage time in exchange for a longer recovery time.

Additional context Internal link discussing CC data plane reliability including outages of this kind: https://docs.google.com/document/d/10dpBiAug-BIJZcLxWitWkmamNVu5fZYCeIJ0_ukR3n4/edit

Jira issue: CRDB-18932

bdarnell commented 2 years ago

I agree with the overall sentiment here of "when you're in a hole, stop digging", and that includes taking fairly drastic measures as a last resort when less disruptive safeguards (admission control) have failed.

Is blocking all writes to pebble the right drastic measure? We don't do a good job of distinguishing read and write operations today; I think it's likely that a node that can't write to disk is as good as totally down, so rather than introduce a new state of "up, but can't write to pebble" it's probably better to just shut the node down (and then immediately begin a "manual" compaction?).

I think an inverted LSM is similar in its effect to a disk stall, and we should consider handling them the same way. Shut down the node with as much grace as we can manage under the circumstances: try to drain leases away and maybe try to decommission ourselves (if the inverted LSM doesn't impede that process too much).

jbowens commented 2 years ago

it's probably better to just shut the node down (and then immediately begin a "manual" compaction?).

I agree.

@joshimhoff do you have any information on the origin of those LSM inversions in Cloud? Which resources do we see exhausted (CPU? disk bandwidth? or no hardware resources, just compaction concurrency?). We didn't get to dynamically adjusting compaction concurrency (cockroachdb/pebble#1329) for the 22.2 release, and I'm wondering if it's worth backporting a very crude, conservative heuristic to scale compaction concurrency with read amplification even in the absence of any knowledge of resource utilization.

joshimhoff commented 2 years ago

I agree with the overall sentiment here of "when you're in a hole, stop digging", and that includes taking fairly drastic measures as a last resort when less disruptive safeguards (admission control) have failed.

@jbowens, do you agree with this sentiment?

it's probably better to just shut the node down (and then immediately begin a "manual" compaction?).

I'm digesting this still, but I feel more strongly that we would benefit from a hard no L0 sublevels over this number threshold (or similar) at a low layer in the stack like storage than I do with the exact action CRDB takes in case that threshold is hit!

do you have any information on the origin of those LSM inversions in Cloud?

Here is an internal-only writeup with additional links: https://docs.google.com/document/d/10dpBiAug-BIJZcLxWitWkmamNVu5fZYCeIJ0_ukR3n4/edit. The "cause" is always AC gaps, including ones that still exist, such as the no AC on followers gap (some progress has been made in 22.2 but with the progress feature flagged off).

Re: that heuristic, seems like an interesting idea!

jbowens commented 2 years ago

do you agree with this sentiment?

Yeah, there exists some threshold where read amplification is too high for a node to be effective. When that threshold is crossed, there's no point to continuing to pile on compaction debt when the node is already effectively unavailable.

I'm digesting this still, but I feel more strongly that we would benefit from a hard no L0 sublevels over this number threshold (or similar) at a low layer in the stack like storage than I do with the exact action CRDB takes in case that threshold is hit!

Makes sense. I don't think there's much we can do besides bring down the node, at least at the storage layer. Rejecting writes would probably only exacerbate any cluster instability as a result of the overload. Ben's comparison to a disk stall is apt.

The "cause" is always AC gaps

But is there a trend to what resource we exhaust as a result? The AC gap allows more writes than it should given the resources available, which exhausts some resource limit and causes compactions to fall behind. But some resource needs to be exhausted to cause the inversion, whether it be disk, CPU or more artificial, self-imposed resource limits like compaction concurrency.

joshimhoff commented 2 years ago

Yeah, there exists some threshold where read amplification is too high for a node to be effective.

Nice! The more important thing at an SRE level is to set the threshold based on the worst case outage length we accept. That is, we might want to not allow so many sublevels in L0 that expected recovery takes longer than an hour. It might NOT be the case that read amp at that point is not so high the node is totally ineffective. OTOH maybe in practice these conditions lead to roughly the same threshold on sublevels in L0.

But some resource needs to be exhausted to cause the inversion, whether it be disk, CPU or more artificial, self-imposed resource limits like compaction concurrency.

Makes sense! I don't have great sense of which resource has been exhausted. We'd def both vertically scaled disks including iops / bandwidth AND uped compaction concurrency.

jbowens commented 2 years ago

We revisited this during storage triage. Given that the only reasonable action that we could take is to force the node to shutdown, maybe this issue doesn't require Cockroach-level code changes. If required, can SREs adjust either tooling or runbooks to shutdown nodes when read amplification exceeds a certain level?

We should also chat about some recent improvements made to reduce recovery time. Maybe some of these improvements are enough to avoid needing to force shutdown nodes.

joshimhoff commented 2 years ago

If required, can SREs adjust either tooling or runbooks to shutdown nodes when read amplification exceeds a certain level?

SRE time from page to keyboard is 50m. Might go down to 30m but it's 50m right now. Either way, I think that is too slow! Also, I think we should automate what we can automate, and this seems "easy" to automate.

Then the Q is where is the right place for the "automation" to live. I'd argue inside of CRDB. We have no watchdog component that exists outside the DB to take the shutdown op. We could add a watchdog process, but I think it's a lot of complexity that isn't motivated sufficiently by this single ticket.

The next Q might be why not add this control to CRDB, especially if by default the control is not enabled? Is it complex to implement? Are there open Qs on storage team's mind about the suggested control?

We should also chat about some recent improvements made to reduce recovery time.

Sounds great!

Maybe some of these improvements are enough to avoid needing to force shutdown nodes.

I think the fact that we are reducing the likelihood of LSM inversion doesn't change the soundness of the principle articulated by Ben here:

I agree with the overall sentiment here of "when you're in a hole, stop digging", and that includes taking fairly drastic measures as a last resort when less disruptive safeguards (admission control) have failed.

Maybe we can 1:1 about this ticket (& other stuff as per your above note)? If you are up for it, I'll find some time.

jbowens commented 2 years ago

I think the fact that we are reducing the likelihood of LSM inversion doesn't change the soundness of the principle articulated by Ben here:

I agree with the principle, but I also think we need to adhere to a principle of "do no harm," and intentionally limiting availability has the potential to do much harm. It risks turning degraded performance into a full outage.

Maybe we can 1:1 about this ticket (& other stuff as per your above note)? If you are up for it, I'll find some time.

Sounds good, I think @mwang1026 is also interested.

joshimhoff commented 2 years ago

I agree with the principle, but I also think we need to adhere to a principle of "do no harm," and intentionally limiting availability has the potential to do much harm. It risks turning degraded performance into a full outage.

Yes, I agree that principle is very important too. The inherent tension between these two principles is the crux of the trickiness of this ticket. Despite the trickiness, my view is that heavily degraded perf for N=12 hours is worse for the user than harder unavailability for 30m. I think perhaps to sell folks on this argument, I should run some experiments demonstrating in terms of a known workload exactly what the tradeoff will look like. As suggested up above, I think whether my claim above is right depends on what we mean concretely by "heavily degraded". A minor degradation in perf can be handled for N=12 hours, but I think the degradation that happens in practice is too severe for a N=12 hours to be acceptable.

Will schedule some time.

bdarnell commented 2 years ago

Then the Q is where is the right place for the "automation" to live. I'd argue inside of CRDB. We have no watchdog component that exists outside the DB to take the shutdown op. We could add a watchdog process, but I think it's a lot of complexity that isn't motivated sufficiently by this single ticket.

I'll take the other side of this: putting it in CRDB will at a minimum slow time-to-delivery since a change would need to be built and backported to every supported version and then rolled out. A CC monitoring component with the ability to restart nodes could be used for this and potentially other things too. Furthermore, a monitoring/orchestration approach naturally lends itself to a cluster-wide view and could avoid taking out too many nodes at a time (which is probably the biggest risk with building any sort of self-termination logic into the DB code).

joshimhoff commented 2 years ago

Thanks for taking the the other side, @bdarnell!

I'm very interested in the idea of this new component, mainly because as per internal link I observe that the majority of the tricky "architectural" issues in KV we have today, which lead to persistent outage in case of poor conditions in network & similar, can be resolved with a low time to fix just by restarting CRDB nodes. But I also have a lot of questions about it. I wonder if you can give some input.

What is the scope of this watchdog component? How do we decide when to put a control in it vs. in the DB itself like we already have done with the disk scale detector?
Who owns this watchdog component? If it is SRE (I'd argue we should at least be a co-developer), how do we test the component together with the DB? How do we collaborate on the component with DB eng? The component can do a lot of damage to the data plane, so I think these Qs are important.
Is the watchdog component CC only? Why?

And then more tactically:

Furthermore, a monitoring/orchestration approach naturally lends itself to a cluster-wide view and could avoid taking out too many nodes at a time (which is probably the biggest risk with building any sort of self-termination logic into the DB code).

Is this idea relevant in case of the control mentioned in this ticket? The idea is that at a certain level of LSM inversion, it is better to fail hard, so additional LSM inversion doesn't happen, else we can have an arbitrarily long outage, as in practice we do have 6+ hour LSM inversion outages. I feel like we should fail hard in case of too much LSM inversion regardless of the state of the other nodes in the cluster based on this argument.

andrewbaptist commented 2 years ago

One interesting point to consider with any automated "shutdown" solution is the impact on the cluster-wide level. Specifically, consider an operation that causes LSM inversion on every node in the system that would take >60 minutes to recover from. If this caused a system-level shutdown of nodes, and they all fail to restart because of too much LSM inversion this would be very bad.

One advantage of an external watchdog is it could grow to be a little smarter over time. For instance, if it sees one node with LSM inversion and the rest of the nodes close to the limit, it could take a different action than if one node has LSM inversion and the rest are completely healthy. That said, it is easy to consider cases where the watchdog makes things worse rather than better. An example could be a moving hot-key that causes LSM inversion.

I personally lean more towards nodes keeping themselves healthy in all cases. What this means in practice is a system limit on LSM inversion. The 60-minute repair goal mentioned above seems reasonable. After the limit is hit, the store begins altering its behavior to lead to recovery. A full shutdown of writes should not be required, but instead something like Golang GCAssist. An example of how this could work is that after a new L0 sublevel is created if it determines that there are "too many" - it immediately runs additional compactions. A lot of hand-waviness about exactly where this should be injected, but the goal would be that this is only hit in extreme cases and is the least bad way to recover. The alternatives are bleak (either we get into a situation that takes support hours to dig out of, or we shutdown the node and affect system availability).

The goal would be that we never hit this limit (because of AC protection and allocator protection), but having this failsafe built into cockroachdb will ensure it is there in all deployments. I agree with @bdarnell point that this will mean it takes longer to get to all existing systems, and so it makes sense to have a CC monitoring system short term, but eventually that would ideally be deprecated.

jbowens commented 2 years ago

I should run some experiments demonstrating in terms of a known workload exactly what the tradeoff will look like.

I'd be interested in these results. I think there should be a very high bar to intentionally taking a hard outage at the cluster level, so I think Ben's point about taking into account the wider view of the cluster is relevant here. I'm unsure how necessary a stop-node failsafe will be if we lift the compaction concurrency limit with a simple heuristic based on L0 sublevels. Increasing the compaction concurrency helps compactions reshape the LSM faster, but it can also slow the rate of incoming writes by consuming disk bandwidth and CPU, leaving fewer resources for memtable flushes.

I'm a little wary of relying on the Cockroach process's cooperation in the general class of issues that lead us to restart a node. We rely on the Cockroach process 's cooperation in exiting when it encounters a disk stall, but we've had issues with gaps in detection (cockroachlabs/support#1520) and bugs in actually terminating the process (eg, blocking on a log file write during a Fatal call #81025).

joshimhoff commented 2 years ago

@andrewbaptist & @jbowens, that all makes sense! Especially this point:

Increasing the compaction concurrency helps compactions reshape the LSM faster, but it can also slow the rate of incoming writes by consuming disk bandwidth and CPU, leaving fewer resources for memtable flushes.

I wonder if we can take this idea & turn into a stronger guarantee of max L0 sublevels / time to un-invert.

bdarnell commented 2 years ago

What is the scope of this watchdog component? How do we decide when to put a control in it vs. in the DB itself like we already have done with the disk scale detector?

The most important consideration is our confidence that the metric represents a state that is A) bad enough to risk disruptive measures (i.e. how badly is performance degraded?) and B) likely to be specific to one node (as opposed to something likely to affect the whole cluster or all replicas of a certain range). Both of these things should be true for a node to consider terminating itself; if not it's better to locate the decision in a component with a more holistic view with rules like "do not terminate more than one node per hour for an inverted LSM".

A secondary consideration is whether the metric is already observable or whether we need to roll out new patches anyway just to get visibility.

Who owns this watchdog component? Is the watchdog component CC only?

I think this is a good fit for the k8s operator, right?

how do we test the component together with the DB? How do we collaborate on the component with DB eng?

I think the process is going to vary on a case-by-case basis but it would start with what you do manually - how do you work with the DB teams to decide that killing a node is the right response to an inverted LSM? And then you'd extend that engagement to figuring out what the right leading indicators are that warrant an automated restart.

joshimhoff commented 2 years ago

Thanks, Ben! Those are helpful comments.

github-actions[bot] commented 7 months ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!