storage: Improve reliability of node liveness

a-robinson commented 6 years ago

Opening a tracking/organizational issue for the work behind trying to make node liveness more reliable in clusters with very heavy workloads (e.g. #15332). More thoughts/ideas very welcome.

Problem definition

Node liveness heartbeats time out when a cluster is overloaded. This typically makes things even worse in the cluster, since nodes losing their liveness prevents pretty much all other work from completing. Slow node liveness heartbeats are particularly common/problematic during bulk I/O jobs like imports or non-rate-limited restores. Slow heartbeats have also become a problem due to GC queue badness in at least one case

Options

Detect bad disks on startup and just refuse to run if they’re bad
- This wouldn’t fix the IMPORT issues we’ve been seeing on GCE - GCE’s disks aren’t that bad
Implement prioritization of different kinds of work (e.g. liveness > normal KV operations > bulk I/O)
- Would just doing this for disk I/O be enough? Or would we need network prioritization as well?
- Likely a pretty large project to propagate priority info around everywhere, and it looks like the only idea of priority that RocksDB has is a low-priority option
Move node liveness in-memory
- Would have the broadest benefits, but we’d need to be very careful, since node liveness is intimately tied to correctness. I’ve spent some time thinking about this but haven’t been able to convince myself of a way to make it safe. We may be able to do something that allows heartbeats to work in memory and only put epoch increments to disk, but the liveness range would need a lot more special handling. Details TBD.
- Even if we did this, bulk I/O still might slow down OLTP workloads more than is acceptable, necessitating other work to keep it under control.
Reduce GC threshold for node liveness range
- Seems like an easy, but very small, win. Unlikely to fix much by itself.
Learn how to detect and avoid I/O throttling / slowdown
- Something along the lines of TCP’s flow control, e.g. start slow, measure latency, speed up writes until latency starts increasing beyond acceptable levels, then back off to stabler I/O rates.
Tweak RocksDB and/or filesystem settings to reduce impact of bulk I/O
- Peter has already found a couple sysctl settings that help, and there may be others. This wouldn’t really improve node liveness in general, but might be able to avoid the worst issues during imports/restores.

Experiments/TODOs

[x] Don’t increment another node’s epoch if the new proposed leaseholder is also not live - insufficient (https://github.com/cockroachdb/cockroach/pull/19436#issuecomment-340057144)
[x] Cancel node liveness heartbeat contexts less aggressively - insufficient (https://github.com/cockroachdb/cockroach/pull/19436#issuecomment-340057144)
[ ] Optimistically gossip node liveness in advance of heartbeats succeeding - initially unsuccessful (https://github.com/cockroachdb/cockroach/pull/19436#issuecomment-339206066), but may merit additional consideration
[x] Default the node liveness range’s GC threshold much lower (#17628)
[ ] Use low_pri RocksDB WriteOption? Would need to bump our RocksDB version
[ ] Disable RocksDB WAL on the temp store? It supposedly hurt performance, but might reduce I/O
[ ] Double check that the “blocking” of writes caused by IngestExternalFile isn’t causing problems
[x] Additional miscellaneous RocksDB/sysctl setting tweaks (https://github.com/cockroachdb/cockroach/pull/20442)
[ ] Prototype a throttler that adjusts bulk I/O throughput in reaction to the latency of disk syncs
[ ] Work out details and write RFC for doing node liveness via an in-memory mechanism
[ ] Work out details and write RFC for some sort of priority mechanism
[ ] Make command queue work LIFO instead of FIFO for liveness records and intents (https://github.com/cockroachdb/cockroach/issues/19699#issuecomment-340946319)
[ ] Move node liveness to non-versioned keys (https://github.com/cockroachdb/cockroach/issues/20399)

bdarnell commented 6 years ago

I think there's an opportunity to add more upstream-of-raft deduping (for this and certain other operations including intent resolution). When lots of heartbeat updates occur, we queue them up in the command queue and apply them in order. If we have trouble keeping up, this queuing pushes us further and further behind. We could be more intelligent about this: for liveness updates, we would generally prefer LIFO rather than FIFO behavior in the "queue".

a-robinson commented 6 years ago

Yeah, that's a good point. I'll add it to the list.

However, I think it would only help us after nodes had already started having heartbeats time out and losing their liveness, since otherwise each node just sends one heartbeat at a time.

bdarnell commented 6 years ago

Reduce GC threshold for node liveness range

Seems like an easy, but very small, win. Unlikely to fix much by itself.

GC is expensive too, which has been known to exacerbate problems here. A better solution would be to migrate liveness to non-versioned keys. This would enable various rocksdb optimizations when keys are overwritten (as opposed to new keys written and old ones deleted). The reason we have versioned liveness keys is just so we can (ab)use the commit trigger mechanism; in order to move to non-versioned keys we'd need to introduce some sort of trigger that can be used for normal KV ops. The migration will probably be the hardest part of making this change. I think this would be a very big win, though.

Move node liveness in-memory

Would have the broadest benefits, but we’d need to be very careful, since node liveness is intimately tied to correctness. I’ve spent some time thinking about this but haven’t been able to convince myself of a way to make it safe. We may be able to do something that allows heartbeats to work in memory and only put epoch increments to disk, but the liveness range would need a lot more special handling. Details TBD.

Even if we still need heartbeats to go to disk (we need something to go to disk, so that a new lease holder of the liveness range doesn't allow epoch increments that contradict a heartbeat acked by the previous lease holder), we could use batching/buffering of heartbeats. Heartbeats are never latency-sensitive (as long as they complete within the expiration time, which is relatively high), so the owner of the liveness range could run a new heartbeat service that would sit above raft and collect heartbeats to write as a batch.

tbg commented 6 years ago

There are never any intents on the liveness range, so the only GC foe is removing old versions, which I believe is now optimized enough to not present a problem (given the default GC settings) on that range.

At the end of the day, the biggest culprit seems to be disk i/o or more specifically large sync latencies, right? I don't feel we (or at least I) understand this well enough. @petermattis has made some progress in understanding these large latencies as long as contending processes write to the file system (which is now the case since we have multiple RocksDB instances thanks to DistSQL, and also sstable ingestion which writes directly to disk), but that likely doesn't explain everything. Do we feel we understand well enough why on a decent GCE disk an import still causes liveness failures (unless sufficiently i/o restricted?)

It'll be near impossible to have things working smoothly in the presence of 40s+ commit latencies, so while addressing "compound errors" as things go wrong seems important, it doesn't seem that we can make things "work" unless we also increase the liveness duration well above the sync latency.

a-robinson commented 6 years ago

Yeah, the disk overload issue is what I'm currently experimenting with. Pulling out the CSV sst-writing code and experimenting with it in isolation has already revealed some pretty pathological behavior (seconds of blocking by RocksDB) even when it's the only thing running on a machine.

petermattis commented 6 years ago

Does the sstable writing that occurs outside of RocksDB "trickle" fsyncs? RocksDB contains a bytes_per_sync option which we currently have disabled. The idea is to limit the amount of dirty data. Probably worthwhile to experiment with enabling this and adding something similar to all of our sstable writing.

a-robinson commented 6 years ago

I can look at that as well, but I'm not even getting to that part yet. Just pushing lots of Puts into RocksDB really fast is enough to get multi-second stalls.

a-robinson commented 6 years ago

The low_pri Rocksdb write option from the 5.6 release is very promising. Without it, doing Puts in a tight loop on a non-local GCE SSD regularly gets blocked for seconds, often for more than 10 seconds. After enabling it, a 3 minute run only got blocked for 1 second once, with no other complete gaps. It hurts throughput noticeably, but that's for the best if maxing out throughput is killing latencies.

To provide more detail, I'm running this experimental code, which mimics the writing done by the import code. Here's the output from running a couple different configurations for 3 minutes:

disableWal=true,low_pri=false on top of https://github.com/cockroachdb/rocksdb/tree/crl-release-5.7.4: batch-16-disablewal-57.txt

disableWal=true,low_pri=true on top of https://github.com/cockroachdb/rocksdb/tree/crl-release-5.7.4: batch-16-disablewal-lowpri.txt

I still need to test how it'll effect a second RocksDB instance given that our temp store is separate from our main store, but it's looking promising enough to be worth bumping RocksDB and using for temporary store writes.

tbg commented 6 years ago

The low_pri Rocksdb write option from the 5.6 release is very promising. Without it, doing Puts in a tight loop on a non-local GCE SSD regularly gets blocked for seconds, often for more than 10 seconds.

Wow, really? That strikes me as completely unexpected. I'm not familiar with the RocksDBMultiMap used, but essentially this is just doing batched writes in a single goroutine? Does this all reproduce if you go really basic and just use a rocksdb.Batch()? I'd assume yes.

I also assume this is specific to the cloud VMs and you don't see it locally?

The low_pri option looks promising, but I'm curious to understand better why it's so bad when you don't use it in the first place.

petermattis commented 6 years ago

To provide more detail, I'm running this experimental code, which mimics the writing done by the import code.

Do you see any problems running on your laptop? Or does this only occur on GCE local SSD? Also, I leave having a small reproducible case of badness like this. We'll track down what is going on here very soon.

a-robinson commented 6 years ago

I don't see the same issues on my laptop. The performance takes some extended dips to half its original throughput, but never all the way down to 0.

Also, I tried experimenting with using your debug synctest (https://github.com/cockroachdb/cockroach/pull/19618) command at the same time as my disk contention command (on GCE), and while setting low_pri on my code helps a little bit compared to not doing so, it's far from preventing long periods of zero sync throughput.

petermattis commented 6 years ago

Have you set the vm.dirty_* sysctls with your diskcontention test?

petermattis commented 6 years ago

@a-robinson What GCE machine type are you using? I just tried to reproduce the badness you're seeing using an n1-highcpu-8 machine and saw a few dips, but no periods of 0 throughput.

a-robinson commented 6 years ago

Have you set the vm.dirty_* sysctls with your diskcontention test?

It looks better than with the defaults, but still has sizable stalls (with the longest being 11 seconds in my few minutes of testing).

a-robinson commented 6 years ago

@a-robinson What GCE machine type are you using? I just tried to reproduce the badness you're seeing using an n1-highcpu-8 machine and saw a few dips, but no periods of 0 throughput.

My gceworker -- i.e. 24 vCPUs, 32 GB memory, 100GB non-local SSD.

petermattis commented 6 years ago

My gceworker -- i.e. 24 vCPUs, 32 GB memory, 100GB non-local SSD.

Ah, I've seen problems with throttling when using Persistent SSD. Never investigated where they are coming from. I recall reading that Persistent SSD throttles based on the size of the disk. Can you test on Local SSD and verify you're not seeing a problem?

a-robinson commented 6 years ago

Yeah, using local SSD avoids the extended zero QPS periods. It's pretty frustrating that the throttling is so bad given how inconvenient local SSDs are to use and how things like our Kubernetes config completely rely on remote disks. I wonder whether the GCE PD team cares about the effects of the throttling behavior.

petermattis commented 6 years ago

Have the IMPORT tests that have been experiencing problems been using PD SSD or Local SSD?

I wonder whether the GCE PD team cares about the effects of the throttling behavior.

Doesn't hurt to file a bug. Especially if we can narrow down the workload that exhibits badness.

a-robinson commented 6 years ago

The import tests I was running were all on local SSD.

a-robinson commented 6 years ago

Alright, so I have a simple program that tickles the disk badness even on local SSD. It basically just runs what I had before in parallel with synctest and only prints the sync throughput/latencies, although I slightly tweaked what was there before to even more closely mimic the import sstWriter distsql flow.

The low_pri and disableWAL options don't appear sufficient to help, at least not enough to matter. Setting the vm.dirty_bytes and vm.dirty_background_bytes settings does, keeping latencies bound to no more than a few hundred ms. However, how realistic is it for us to expect users to modify these settings? It seems like quite a stretch.

I'm going to try writing a basic adaptive rate-limiter that limits writes to the temp store in response to the primary store's sync latencies and see how it affects things. Another avenue to explore is just changing the import code to restrict its own parallelism. Spinning up hundreds of rocksdb instances and writing to them all in parallel seems like asking for trouble.

Also, as suggested for a similar situation recently, there's a good chance that https://github.com/cockroachdb/cockroach/pull/19229 would help with node liveness in the case that all nodes' disks are overloaded, since it'd switch us from waiting for the <leader's sync> + <fastest follower's sync> time to waiting for max(<leader's sync>, <fastest follower's sync>) time.

a-robinson commented 6 years ago

Two updates:

The sysctl settings have made for a successful distributed scale-factor-10 IMPORT https://github.com/cockroachdb/cockroach/pull/19436#issuecomment-340597768
It looks like investigating different RocksDB settings may be more fruitful than pure rate limiting (assuming we can't rely on changing the sysctl settings). Even with Puts rate limited to just 16MB/s, after enough time (3 minutes with low_pri/disableWal off, 10 minutes with them on) sync latencies still start to break down. This is almost certainly due to compactions in RocksDB -- I regularly see 150+MB/s of disk writes, with occasional spikes to 220+MB/s.

bdarnell commented 6 years ago

However, how realistic is it for us to expect users to modify these settings? It seems like quite a stretch.

I don't think it's unreasonable to require sysctl or similar tweaks for heavily loaded servers. We already have to advise raising the default file descriptor limit and every database will have a page of settings that may need to be tweaked. Of course it's better if we can find a solution that doesn't require this, but that may not always be possible (especially when we're contending with non-cockroach processes).

Spinning up hundreds of rocksdb instances and writing to them all in parallel seems like asking for trouble.

Yikes! We should definitely stop doing that.

petermattis commented 6 years ago

Spinning up hundreds of rocksdb instances and writing to them all in parallel seems like asking for trouble.

Yikes! We should definitely stop doing that.

Agreed. I wasn't aware we were doing this.

petermattis commented 6 years ago

It looks like investigating different RocksDB settings may be more fruitful than pure rate limiting (assuming we can't rely on changing the sysctl settings). Even with Puts rate limited to just 16MB/s, after enough time (3 minutes with low_pri/disableWal off, 10 minutes with them on) sync latencies still start to break down. This is almost certainly due to compactions in RocksDB -- I regularly see 150+MB/s of disk writes, with occasional spikes to 220+MB/s.

And you see this with bytes_per_sync set to some small-ish value? One thing I've looked at is grep Dirty /proc/meminfo. As long as that value stayed reasonable, small writes went through in a reasonable time frame. If you see the dirty pages increasing during your testing with bytes_per_sync then we need to track down where the dirty bytes are coming from.

a-robinson commented 6 years ago

And you see this with bytes_per_sync set to some small-ish value?

Yeah, I even see it when I use rocksdb's native rate limiting. Dirty bytes (as reported by /proc/meminfo) steadily increase through the hundreds of megabytes before dropping back down and doing it over again. Something's fishy here.

petermattis commented 6 years ago

Something's fishy here.

Perhaps bytes_per_sync isn't being used on some code path within RocksDB. If you provide reproduction instructions, I can help track it down.

petermattis commented 6 years ago

Also, given the badness of sizeof dirty pages on Linux, perhaps we should add a metric for /proc/meminfo:Dirty.

a-robinson commented 6 years ago

Reproduction instructions:

Check out https://github.com/a-robinson/cockroach/tree/diskcontention2
Build: make build BUILDTARGET=./pkg/cmd/diskcontention/
Run the resulting binary without any command line flags from a directory that's on a local SSD

a-robinson commented 6 years ago

To sanity check that it wasn't an issue with my mount flags or VM kernel settings, I ran this simple program to make sure fsync is doing what I expect. It is doing what I expect on all three GCE disk types (pd-standard, pd-ssd, local-ssd) -- before calling fsync, there are a bunch of dirty bytes; after calling fsync, there aren't.

So yeah, rocksdb probably isn't syncing when it's supposed to. I'll narrow this down to a smaller repro case and report it upstream unless you have a hankering to look into it yourself.

a-robinson commented 6 years ago

I've also been confirming along the way that the OPTIONS file rocksdb creates has the desired options in it, so they are being set as I intend.

petermattis commented 6 years ago

From offline discussion: bytes_per_sync affects WritableFileWriter::Flush(). But BlockBasedTableWriter never calls Flush() AFAICT. SstFileWriter::Finish() does call WritableFileWriter::Sync(), but that sync occurs after the entire sst has been written (possibly 128MB).

I think we can add a call to r->file->Flush() to the end of BlockBasedTableWriter::WriteRawBlock().

a-robinson commented 6 years ago

Two pieces of bad news:

Adding calls to r->file->Flush() in BlockBasedTableWriter::WriteRawBlock() does not prevent the growth of dirty bytes into the hundreds of megabytes.
I set sysctl vm.dirty_bytes=$[4 * 1024 * 1024] and sysctl vm.dirty_background_bytes=$[1 * 1024 * 1024] after the above experiment had been running for a while and clearly wasn't working out great. It did create the desired dirty bytes behavior and flattened out the tail latencies of small syncs to disk. However, after a few minutes, the small syncs to disk got blocked for 40 seconds. Throughout this time period, writes to disk were still happening -- the number of dirty bytes was changing every time I checked. This probably means that L0 was full and thus incoming writes were blocked on compactions, which were slow, but that's way longer than I'd ever expect to be blocked.

petermattis commented 6 years ago

Adding calls to r->file->Flush() in BlockBasedTableWriter::WriteRawBlock() does not prevent the growth of dirty bytes into the hundreds of megabytes.

Huh. Either the dirty data must be coming from somewhere else, or WritableFileWriter::Flush isn't doing the right thing. I suppose you can start sprinkling calls to WritableFileWriter::Sync.

petermattis commented 6 years ago

However, after a few minutes, the small syncs to disk got blocked for 40 seconds. Throughout this time period, writes to disk were still happening -- the number of dirty bytes was changing every time I checked. This probably means that L0 was full and thus incoming writes were blocked on compactions, which were slow, but that's way longer than I'd ever expect to be blocked.

There are stats about L0 compaction stalls. This suspicion seems reasonable, but it should be verified.

a-robinson commented 6 years ago

It appears as though the main cause of the large stoppages was our background thread settings, not the dirty bytes. I've changed from calling options.IncreaseParallelism(std::max(db_opts.num_cpu, 2)); to setting options.max_background_jobs = std::max(db_opts.num_cpu, 2); and the stalls appear to have completely gone away, with no periods of 0 syncs and the rocksdb.stall.micros stat value at 0 after more than 25 minutes (my previous attempt without the change started hitting stalls after just 7 minutes).

I got to this point by correlating periods of 0 QPS with very long compactions and very large increases in the rocksdb.stall.micros. There seemed to be a connection between large L5->L6 compactions and the QPS troughs. Compactions are supposed to be done in the background, but they were clearly blocking writes much worse than the periodic Stalling writes because we have 3 immutable memtables (waiting for flush) messages. Also, the rocksdb.stall.micros metric was the only stall metric that was really growing, and it's described as being incremented when the "Writer has to wait for compaction or flush to finish."

It looks like rocksdb has reworked a lot of the background settings given that so many of them are marked as "NOT SUPPORTED ANYMORE", so maybe our configuration worked fine in the past, but it seems pretty bad now. I'll give this a try on a pd-ssd rather than just the local-ssd I've been using, and if that goes well also try it on distributed import.

a-robinson commented 6 years ago

@petermattis This settings change would likely also help us when external processes are using the disk

petermattis commented 6 years ago

Interesting. But if you're not calling IncreaseParallelism, aren't you limiting the number of threads available for background jobs to the default. That in itself might be reducing the background disk activity.

a-robinson commented 6 years ago

Just a quick update: I've been trying to repro the bytes_per_sync issue with a small C++ program, but it appears to work perfectly in all my testing, despite trying to make the settings as similar to what we use as possible. I'll update this thread once I figure out where we're going wrong.

a-robinson commented 6 years ago

Ugh, it's just a build issue. Our build system isn't defining a bunch of rocksdb build flags, with the one that matters in this case being -DROCKSDB_RANGESYNC_PRESENT. If I remove the preprocessor check for ROCKSDB_RANGESYNC_PRESENT from the relevant rocksdb files to just assume the sync_file_range syscall exists, then we get the expected bytes_per_sync behavior. Apparently bytes_per_sync is silently not supported without that build flag.

a-robinson commented 6 years ago

Alright, @benesch has kindly volunteered to fix rocksdb's cmake configuration to include the build flags. Once that's committed and we can pull it in, I'll update our rocksdb::Option settings to what has been shown to work well in my tests.

The improvement is magical -- so far I haven't seen any stalls or latency spikes on local-ssd, and they've been significantly less frequent and severe on pd-ssd.

petermattis commented 6 years ago

Nice found about -DROCKSDB_RANGESYNC_PRESENT.

Once that's committed and we can pull it in, I'll update our rocksdb::Option settings to what has been shown to work well in my tests.

Did you see my earlier question about not calling IncreaseParallelism?

a-robinson commented 6 years ago

I didn't directly notice any problems when I wasn't setting the number of threads, but shortly after your comment I just pulled out the couple relevant lines from IncreaseParallelism to make sure the number of threads wasn't affecting the testing. I haven't done direct comparisons between setting the number of threads and not setting the number of threads -- I've just been planning on leaving them in.

petermattis commented 6 years ago

I haven't done direct comparisons between setting the number of threads and not setting the number of threads -- I've just been planning on leaving them in.

Sounds good to me.

benesch commented 6 years ago

Patch is upstreamed: https://github.com/facebook/rocksdb/pull/3211

Provided it passes upstream CI, I'll open a PR to backport it to our RocksDB fork.

a-robinson commented 6 years ago

To summarize the developments a couple weeks ago, the syncing changes fixed rocksdb stalls on GCE local SSDs, but not on other GCE disk types.

I had hypothesized that the stalls on other disk types were due to throttling behavior in the virtualized disks, but if GCE's monitoring data is to be trusted (screenshot below), the write rate to the disk from the VM is constantly maxed out in my testing (using the same testing approach as in https://github.com/cockroachdb/cockroach/issues/19699#issuecomment-347554908), not going up and down with the number of syncs per second. That would suggest that the problem is in rocksdb or our use of it, potentially in fighting between the temporary bulk-io rocksdb instance and the permanent OLTP instance. Here are rocksdb logs from the two writers that I'll be digging through when I get the chance: pd-ssd-output.txt pd-ssd-log-sync.txt pd-ssd-log-data.txt

Disk I/O monitoring:

tbg commented 6 years ago

@a-robinson could you give an update here? Is there something specific we can make this issue about?

a-robinson commented 6 years ago

When I last touched this issue the status was that we had seriously improved disk contention, but didn't have confidence that it would be enough. The most immediate unfinished business was that the disk contention stress tool from https://github.com/cockroachdb/cockroach/issues/19699#issuecomment-347554908 could still stall for many seconds at a time on non-local SSDs.

Beyond that, though, we can let our experience with 2.0 guide what's to be done. If we've been seeing liveness timeouts in the field, then there are ideas referenced above for how to improve things further. If not, I'd say we can close this after revisiting the disk contention stress tool to see if we can still generate multi-second pauses.

nvanbenschoten commented 5 years ago

Moving this to 2.2. There hasn't been a lot of noise about node liveness lately, so maybe we should just close this instead. Thoughts @a-robinson? Is there more you'd like to explore here?

bdarnell commented 5 years ago

I'd probably close this issue. There are still things that can be done to improve the reliability of the node liveness system (up to and including replacing the whole thing with something like @andy-kimball 's failure detector), but I'm not sure there's any value left in keeping this broad issue open.

benesch commented 5 years ago

I think the bulk IO team (@dt/@mjibson) might feel otherwise. Last I heard it was still fairly easy for node licenses to cause a restore to fail.

I think there’s probably something to be done with prioritizing node liveness batches. Would Andy K’s failure detection not be subject to the same problem where maxing out disk bandwidth causes heartbeats to be missed?

On Fri, Sep 28, 2018 at 9:59 AM Ben Darnell notifications@github.com wrote:

I'd probably close this issue. There are still things that can be done to improve the reliability of the node liveness system (up to and including replacing the whole thing with something like @andy-kimball https://github.com/andy-kimball 's failure detector), but I'm not sure there's any value left in keeping this broad issue open.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/19699#issuecomment-425444421, or mute the thread https://github.com/notifications/unsubscribe-auth/AA15IAjxXJWtqxzakA8bjBUvx_bQcQ8Hks5ufis4gaJpZM4QNkeF .

cockroachdb / cockroach

storage: Improve reliability of node liveness #19699