storage: Improve reliability of node liveness

a-robinson commented 6 years ago

Opening a tracking/organizational issue for the work behind trying to make node liveness more reliable in clusters with very heavy workloads (e.g. #15332). More thoughts/ideas very welcome.

Problem definition

Node liveness heartbeats time out when a cluster is overloaded. This typically makes things even worse in the cluster, since nodes losing their liveness prevents pretty much all other work from completing. Slow node liveness heartbeats are particularly common/problematic during bulk I/O jobs like imports or non-rate-limited restores. Slow heartbeats have also become a problem due to GC queue badness in at least one case

Options

Detect bad disks on startup and just refuse to run if they’re bad
- This wouldn’t fix the IMPORT issues we’ve been seeing on GCE - GCE’s disks aren’t that bad
Implement prioritization of different kinds of work (e.g. liveness > normal KV operations > bulk I/O)
- Would just doing this for disk I/O be enough? Or would we need network prioritization as well?
- Likely a pretty large project to propagate priority info around everywhere, and it looks like the only idea of priority that RocksDB has is a low-priority option
Move node liveness in-memory
- Would have the broadest benefits, but we’d need to be very careful, since node liveness is intimately tied to correctness. I’ve spent some time thinking about this but haven’t been able to convince myself of a way to make it safe. We may be able to do something that allows heartbeats to work in memory and only put epoch increments to disk, but the liveness range would need a lot more special handling. Details TBD.
- Even if we did this, bulk I/O still might slow down OLTP workloads more than is acceptable, necessitating other work to keep it under control.
Reduce GC threshold for node liveness range
- Seems like an easy, but very small, win. Unlikely to fix much by itself.
Learn how to detect and avoid I/O throttling / slowdown
- Something along the lines of TCP’s flow control, e.g. start slow, measure latency, speed up writes until latency starts increasing beyond acceptable levels, then back off to stabler I/O rates.
Tweak RocksDB and/or filesystem settings to reduce impact of bulk I/O
- Peter has already found a couple sysctl settings that help, and there may be others. This wouldn’t really improve node liveness in general, but might be able to avoid the worst issues during imports/restores.

Experiments/TODOs

[x] Don’t increment another node’s epoch if the new proposed leaseholder is also not live - insufficient (https://github.com/cockroachdb/cockroach/pull/19436#issuecomment-340057144)
[x] Cancel node liveness heartbeat contexts less aggressively - insufficient (https://github.com/cockroachdb/cockroach/pull/19436#issuecomment-340057144)
[ ] Optimistically gossip node liveness in advance of heartbeats succeeding - initially unsuccessful (https://github.com/cockroachdb/cockroach/pull/19436#issuecomment-339206066), but may merit additional consideration
[x] Default the node liveness range’s GC threshold much lower (#17628)
[ ] Use low_pri RocksDB WriteOption? Would need to bump our RocksDB version
[ ] Disable RocksDB WAL on the temp store? It supposedly hurt performance, but might reduce I/O
[ ] Double check that the “blocking” of writes caused by IngestExternalFile isn’t causing problems
[x] Additional miscellaneous RocksDB/sysctl setting tweaks (https://github.com/cockroachdb/cockroach/pull/20442)
[ ] Prototype a throttler that adjusts bulk I/O throughput in reaction to the latency of disk syncs
[ ] Work out details and write RFC for doing node liveness via an in-memory mechanism
[ ] Work out details and write RFC for some sort of priority mechanism
[ ] Make command queue work LIFO instead of FIFO for liveness records and intents (https://github.com/cockroachdb/cockroach/issues/19699#issuecomment-340946319)
[ ] Move node liveness to non-versioned keys (https://github.com/cockroachdb/cockroach/issues/20399)

a-robinson commented 5 years ago

At this point this is an issue tracking a broad concern with the system rather than a specific action we want to take. Node liveness still needs to be improved further over time, but I don't care whether we keep this issue open or just open more specific ones when we decide to do more work here.

bdarnell commented 5 years ago

I think there’s probably something to be done with prioritizing node liveness batches. Would Andy K’s failure detection not be subject to the same problem where maxing out disk bandwidth causes heartbeats to be missed?

One difficulty with prioritizing node liveness batches is that they're currently handled exactly the same way as regular traffic, so prioritizing them seems to require hacky checks on the keys/ranges involved. Andy K's failure detector would at least move any disk IO involved into a separate subsystem so the boundaries for prioritization would be more clear. (I'm not sure whether his scheme even requires disk access for heartbeats or only on failure).

Another problem with the current scheme is that there are two critical ranges: both range 1 and the liveness range must be up for the system to work. I believe Andy's plan would make failure detection independent of range 1 (and of any single range).

benesch commented 5 years ago

At this point this is an issue tracking a broad concern with the system rather than a specific action we want to take. Node liveness still needs to be improved further over time, but I don't care whether we keep this issue open or just open more specific ones when we decide to do more work here.

There are six proposed experiments in the issue description that don't seem to be associated with a more specific issue. It'd be a shame to lose track of those!

tbg commented 5 years ago

I would also leave it open with the goal of (in the 2.2 time frame) examining/prototyping Andy's failure detector and coming to a decision of whether to try to implement it in the foreseeable future.

awoods187 commented 5 years ago

@ajwerner I know you made progress here recently. Have you updated the checklist at the top/reviewed if it is still relevant?

ajwerner commented 5 years ago

Many of the mitigations proposed in this issue deal with io starvation. We have yet to make progress on that front. In recent conversations with @dt it seems like ideas such as

Prototype a throttler that adjusts bulk I/O throughput in reaction to the latency of disk syncs

Are as relevant as ever.

That being said #39172 separates the network connection for range 1 and node licenses and seems to be effective at protecting node licenses in the face of cpu overload which has proven to be a bigger problem than this issue anticipated. It’s not clear that this is the “prioritization” mechanism envisioned by this issue.

github-actions[bot] commented 3 years ago

We have marked this issue as stale because it has been inactive for 18 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the issue queue tidy. Thank you for your contribution to CockroachDB!

nvanbenschoten commented 3 years ago

Closing. We made improvements in this area over the past few years. Most notably, we've isolated network resources for node liveness traffic in https://github.com/cockroachdb/cockroach/pull/39172 and improved the behavior of the Raft scheduler and its handling of the node liveness range under CPU starvation in https://github.com/cockroachdb/cockroach/pull/56860.

cockroachdb / cockroach

storage: Improve reliability of node liveness #19699