cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.9k stars 3.78k forks source link

storage: Improve reliability of node liveness #19699

Closed a-robinson closed 3 years ago

a-robinson commented 6 years ago

Opening a tracking/organizational issue for the work behind trying to make node liveness more reliable in clusters with very heavy workloads (e.g. #15332). More thoughts/ideas very welcome.

Problem definition

Node liveness heartbeats time out when a cluster is overloaded. This typically makes things even worse in the cluster, since nodes losing their liveness prevents pretty much all other work from completing. Slow node liveness heartbeats are particularly common/problematic during bulk I/O jobs like imports or non-rate-limited restores. Slow heartbeats have also become a problem due to GC queue badness in at least one case

Options

Experiments/TODOs

a-robinson commented 5 years ago

At this point this is an issue tracking a broad concern with the system rather than a specific action we want to take. Node liveness still needs to be improved further over time, but I don't care whether we keep this issue open or just open more specific ones when we decide to do more work here.

bdarnell commented 5 years ago

I think there’s probably something to be done with prioritizing node liveness batches. Would Andy K’s failure detection not be subject to the same problem where maxing out disk bandwidth causes heartbeats to be missed?

One difficulty with prioritizing node liveness batches is that they're currently handled exactly the same way as regular traffic, so prioritizing them seems to require hacky checks on the keys/ranges involved. Andy K's failure detector would at least move any disk IO involved into a separate subsystem so the boundaries for prioritization would be more clear. (I'm not sure whether his scheme even requires disk access for heartbeats or only on failure).

Another problem with the current scheme is that there are two critical ranges: both range 1 and the liveness range must be up for the system to work. I believe Andy's plan would make failure detection independent of range 1 (and of any single range).

benesch commented 5 years ago

At this point this is an issue tracking a broad concern with the system rather than a specific action we want to take. Node liveness still needs to be improved further over time, but I don't care whether we keep this issue open or just open more specific ones when we decide to do more work here.

There are six proposed experiments in the issue description that don't seem to be associated with a more specific issue. It'd be a shame to lose track of those!

tbg commented 5 years ago

I would also leave it open with the goal of (in the 2.2 time frame) examining/prototyping Andy's failure detector and coming to a decision of whether to try to implement it in the foreseeable future.

awoods187 commented 5 years ago

@ajwerner I know you made progress here recently. Have you updated the checklist at the top/reviewed if it is still relevant?

ajwerner commented 5 years ago

Many of the mitigations proposed in this issue deal with io starvation. We have yet to make progress on that front. In recent conversations with @dt it seems like ideas such as

Prototype a throttler that adjusts bulk I/O throughput in reaction to the latency of disk syncs

Are as relevant as ever.

That being said #39172 separates the network connection for range 1 and node licenses and seems to be effective at protecting node licenses in the face of cpu overload which has proven to be a bigger problem than this issue anticipated. It’s not clear that this is the “prioritization” mechanism envisioned by this issue.

github-actions[bot] commented 3 years ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

nvanbenschoten commented 3 years ago

Closing. We made improvements in this area over the past few years. Most notably, we've isolated network resources for node liveness traffic in https://github.com/cockroachdb/cockroach/pull/39172 and improved the behavior of the Raft scheduler and its handling of the node liveness range under CPU starvation in https://github.com/cockroachdb/cockroach/pull/56860.