db: automatically tune compaction concurrency based on available CPU/disk headroom and read-amp

cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go

BSD 3-Clause "New" or "Revised" License

4.9k stars 454 forks source link

db: automatically tune compaction concurrency based on available CPU/disk headroom and read-amp #1329

Open sumeerbhola opened 3 years ago

sumeerbhola commented 3 years ago

This issue was originally about the need to reduce tuning knobs in Pebble that govern performance more generally. Over time, some of these have been adjusted, or have had adjustments considered. The one major one that remains is MaxConcurrentCompactions which gets set to min(NumCPU(), 3) which is too few compactions at once than can be handled by many beefy nodes with fast NVMe drives. Pebble, along with Cockroach's admission control, should be able to schedule additional compactions in the presence of greater CPU and disk IO headroom to allow for greater compaction concurrency, without necessitating manual operator intervention to unlock additional performance.

There's a similar case to be made about increasing compaction concurrency even in the presence of heavy foreground write traffic, as it reduces the work necessary to incorporate future foreground write traffic into a well-formed LSM. Even in this case, adaptive increases in compaction concurrency will yield better foreground write performance in the longer run and can be considered even though it might seem instantaneously unintuitive to take away more disk / CPU bandwidth for a background operation.

Original text of the issue follows below the horizontal line.

Pebble has various tuning knobs that affect its CPU and disk IOPS/bandwidth utilization. Examples include L0CompactionConcurrency, CompactionDebtConcurrency, MaxConcurrentCompactions, DeleteRangeFlushDelay, MinDeletionRate. Some of these, like MinDeletionRate and DeleteRangeFlushDelay, affect how fast we reclaim disk space, so are possibly not very interesting in most circumstances -- it is relatively easier to properly provision disk byte capacity.

The compaction ones affect how many concurrent compactions we can run, and being able to run more concurrent compactions would allow Pebble to handle a higher write throughput without having a misshapen LSM. CockroachDB accepts the default value for these knobs except for MaxConcurrentCompactions, which is set to min(numCPU, 3). It is likely that these knob settings are sub-optimal in most settings: (a) the limit of 3 is likely to be too low on large machines, (b) even 3 compactions could be too many on a large machine if there is a spike in read-only user-facing traffic that wants to consume all cpu, or if disk IOPs/bandwidth are saturated due to user-facing traffic.

We have discussed ways to speed compactions in the past by parallelizing them into multiple sub-compactions that partition the key span of the compaction. More recently, there are ideas to parallelize the compression/decompression of ssblocks within a compaction.

These compaction-related ideas have the following need in common: detect the current resource utilization (or overload of a resource), and use that to adjust the amount of concurrent background work, so as to not affect foreground traffic. It is possible that such adjustments would need to be fine-grained and slow down an already started compaction.

Detecting resource utilization or overload for CPU is relatively easier -- the CockroachDB context already has hooks that look at goroutine scheduler state. Disk IOPs/bandwidth is harder: log commit latency is one signal, but it may not always be a good indicator, so we would need to run some overload experiments to understand this better.

Jira issue: PEBBLE-119

bananabrick commented 3 years ago

DeleteRangeFlushDelay and MinDeletionRate have default values of 0, and if we're not resetting those in cockroach(unless I'm missing where these are set), then we're not using the functionality provided by those, currently.

sumeerbhola commented 3 years ago

DeleteRangeFlushDelay and MinDeletionRate have default values of 0, and if we're not resetting those in cockroach(unless I'm missing where these are set), then we're not using the functionality provided by those, currently.

We are setting these here https://github.com/cockroachdb/cockroach/blob/9208567a280d61e0621c154eff6e39ed900f77c0/pkg/storage/pebble.go#L346-L353

sumeerbhola commented 3 years ago

The very high-level idea is to have some kind of centralized scheduler/pacer that is consulted before starting some background work, and at periodic intervals during that work. This could also be invoked for other background work like table stats and table validation. But the hard issues here are in working out the details.

bananabrick commented 3 years ago

I ran some file system level benchmarks to see whether sampling sync latencies could be a way to detect high IO utilization, and if we could rate limit based on that(by throttling writes when utilization is high).

Machine used: aws c5d.4xlarge (the same machine which is used to run our nightly benchmarks).

Methodology I ran each benchmark for 100 seconds, and each benchmark measures the latencies of syncing 1MB to disk while there are go routines running in the background and doing other file system level operations. Each benchmark only differs by the background operation which was being run.

Code used to run the benchmark: https://github.com/cockroachdb/pebble/pull/1344

Results

No background ops.

____optype__elapsed_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
write_sync   100.0s       33573            335.7 0.825039 0.819199  0.950271  1.114111  4.456447

A background goroutine which writes 4KB to the file in a loop with no rate limiting.

____optype__elapsed_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
write_sync   100.0s         3322             33.2 27.791096 0.884735 48.234495 1207.959551 2147.483647

The latencies increase quite a bit, but this is what they look like when the benchmark is running.

____optype__elapsed__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)__pMax(ms)
write_sync      41s            8.0           30.7 20.971519 2550.136831 2550.136831 2550.136831
write_sync      42s            6.0           30.1 62.914559 260.046847 260.046847 260.046847
write_sync      43s            2.0           29.5 62.914559 1342.177279 1342.177279 1342.177279
write_sync      44s            0.0           28.8 0.000000 0.000000 0.000000 0.000000
write_sync      45s            0.0           28.2 0.000000 0.000000 0.000000 0.000000
write_sync      46s            7.0           27.7 41.943039 2818.572287 2818.572287 2818.572287
write_sync      47s            4.0           27.2 71.303167 318.767103 318.767103 318.767103
write_sync      48s            1.0           26.7 1409.286143 1409.286143 1409.286143 1409.286143
write_sync      49s            0.0           26.1 0.000000 0.000000 0.000000 0.000000
write_sync      50s            1.0           25.6 1275.068415 1275.068415 1275.068415 1275.068415
write_sync      51s            6.0           25.2 54.525951 1409.286143 1409.286143 1409.286143
write_sync      52s            1.0           24.8 201.326591 201.326591 201.326591 201.326591

The latencies vary quite a bit, but they remain high.

Any thoughts on other benchmarks I could run to determine if sync latencies are a good metric to use? I'm thinking I could use other kind of background ops to determine if sync latencies are always increasing.

I'm also going to look into pebble's batch commit latencies to see if that's a good metric to use.

itsbilal commented 3 years ago

That sounds like an interesting experiment, thanks for running it! I would expect batch commit latencies to be an even better metric to measure user-observed write latencies for two reasons:

1) it's directly along the write hot path and gives us a good measure of everything in the commit pipeline, and 2) It measures both disk writes and syncs, and ends up being less FS-dependent

It's trickier to decide what to do when write latencies go up. Do we want compaction concurrency to go down to temporarily allow write-amp to go down, while also having a second mechanism to dial it up again if L0 read-amp goes unreasonably high? Or is admission control going to reduce the need for that second mechanism?

bananabrick commented 3 years ago

Thanks for the response, since it's forcing me to think about what exactly my plan is here.

My worry with the batch commit latency is that it won't reflect actual IO utilization which is my goal. It might, if I normalize the batch commit latency by the batch size, but I haven't looked at the code here yet, so there might be more complexities with using batch commit latencies.

It's trickier to decide what to do when write latencies go up. Do we want compaction concurrency to go down to temporarily allow write-amp to go down, while also having a second mechanism to dial it up again if L0 read-amp goes unreasonably high?

I was thinking this would be a bit more generic, and so I haven't thought about compactions specifically. And my plan was to have confidence that any metric we use actually reflects our IO utilization, and not look at metrics like read amp or write amps.

Here's my idea which is high level, and I've yet to explore this.

Let's assume that we have some latency metric L which accurately reflects IO utilization. So, let's assume that when L increases we are doing more IO, and when L is stable or small we have more available IO.

Now we will sample L, and keep a moving average E, a deviation metric S, both more heavily weighted towards more recent values of L.

Let's say the rate limiter has some variable bytesPerSec, which determines the write throughput, and that bytesPerSec has min/max values.

We'll also have some max threshold for L called T which will be dependent on both E and S. For example, it could be T = E + 10 * S(average + 10 * deviation). Now, we can have an algorithm which increases bytesPerSec, if L is within T, and decreases bytesPerSec, if L is greater than T.

For compactions specifically, in this approach, we won't limit compactions at all, if we have bytesPerSec available. One problem I can think of is a scenario where there's so many writes happening that L becomes high, and we therefore throttle compactions/flushes, which causes the memtables to fill up quickly. I don't know if that's a problem or not. Do we want to throttle compactions/flushes if there are many writes happening, but there's no IO available?

itsbilal commented 3 years ago

We both ask a similar question at the end about what to actually throttle when the problematic metric worsens, and I think the story there is going to be incomplete without admission control. We can do (or at least explore) two different types of throttles:

1) Throttle down memtable writes when write IO latency is high beyond a high-watermark amount, to let compactions catch up. 2) Throttle up concurrent compactions when write IO latency is low below a low-watermark amount (suggesting additional available resources), and retain existing checks to increase it when L0 read-amp is high.

You could change the code so the first compaction with no other compactions running is always scheduled, bypassing the pacer(s) completely. We could pace subsequent compactions - and that pacing can be mathematically determined by 1) how high the IO latency is, and 2) how low the L0 read-amp is. So high write-latency, low L0 read-amp gets throttled the most, low write-latency and high L0 read-amp gets the most concurrent compactions, and somewhere in between gets a moderate amount of pacing.

(Side note: the current pacer(s) only look at compaction debt, not L0 read-amp, so this might be a quick change to try out if you're experimenting with pacers).

The general idea is we're trying to compact a bit more aggressively when we have the resources to do so and can easily spend that cost without affecting user writes. And we want ability to let read-amp fall behind to favour user writes when those are affected.

As for the choice of the metric, my concern was that syncs on files had the possibility of only measuring part of what we actually want to reduce, which is "are writes having to wait due to starved out IO". If a theoretical batch commit metric goes up because the average batch has to wait longer to apply due to batches ahead of it, that is an even earlier indicator of us wanting to become more write-friendly than actually letting IO starve out and then affect syncs or something.

All of the above is rough because I do expect it to be tuned by / guided by experimentation, so the experimentation you're doing is very valuable in helping make that decision. If fsync latency ends up being a solid indicator, I'm happy to go with it too.

itsbilal commented 3 years ago

On further thought: we probably never want to throttle writes to memtables (as that’ll just get more KV operations to queue up in memory and increases chances of OOM), and we could probably have flushes remain unpaced; there can only be one flush at a time anyway.

The better place for control of incoming write traffic is going to be admission control. Other than compaction concurrency and compaction pacing I’m not sure if we have control over that many throughput valves in Pebble.

The other thing that I’m trying to stay away from, is having variables (whether automatic or manually set) around provisioned IOPS or throughput in MB/s. We might see noise from other stores, other processes, other “noisy neighbours” which we see from time to time when benchmarking on AWS, and we need to be able to react to that instead of trying to use up a theoretical target throughput.

bananabrick commented 3 years ago

Thanks for the feedback! My idea was only at a high level, so it will be difficult to give concrete answers without further experimentation.

We both ask a similar question at the end about what to actually throttle when the problematic metric worsens, and I think the story there is going to be incomplete without admission control.

I want to throttle or increase bytesPerSecond according to L. Whatever mechanism we want to throttle will make calls into the rate limiter. It could be either flushes, or compactions, or both, or even other pathways which issue writes to the disk. We can have some notion of priority to make sure that compactions/flushes get higher priority, if we want.

I'm not trying to auto-tune the compaction concurrency, although, maybe that's something we should consider.

Throttle down memtable writes when write IO latency is high beyond a high-watermark amount, to let compactions catch up.

I don't think we should be throttling foreground traffic using this rate limiter.

2.Throttle up concurrent compactions when write IO latency is low below a low-watermark amount (suggesting additional available resources), and retain existing checks to increase it when L0 read-amp is high.

I was thinking of making this rate limiter centre around the bytesPerSecond. I'm not sure if concurrent compactions are relevant. If we write both a cpu/io based rate limiter, then we can potentially get rid of the concurrent compactions variable.

We could pace subsequent compactions - and that pacing can be mathematically determined by 1) how high the IO latency is, and 2) how low the L0 read-amp is.

My understanding is that low L0 read-amp shouldn't limit compaction throughput. It should only be limited by not having enough IO resources available.

The general idea is we're trying to compact a bit more aggressively when we have the resources to do so and can easily spend that cost without affecting user writes. And we want ability to let read-amp fall behind to favour user writes when those are affected.

This should happen naturally using the mechanism I described. If we don't have IO resources, then background write throughput will be limited. I'm guessing eventually, we'll stall user writes as the memtables fill up, which will free up some IO resources.

As for the choice of the metric, my concern was that syncs on files had the possibility of only measuring part of what we actually want to reduce, which is "are writes having to wait due to starved out IO". If a theoretical batch commit metric goes up because the average batch has to wait longer to apply due to batches ahead of it, that is an even earlier indicator of us wanting to become more write-friendly than actually letting IO starve out and then affect syncs or something.

I agree that file sync latency isn't guaranteed to work, but I'll be doing more benchmarking for that. I haven't looked at the batch commit code/that entire pipeline, but to me, batch commit latency depends on pebble level bottlenecks too, and it seems like we aren't directly measuring the filesystem/ssd level bottlenecks, which is what we want.

I'll have to read more about our batch commit mechanisms. I'm not against the idea at all. It'll probably be simpler to use that if it works anyway.

The other thing that I’m trying to stay away from, is having variables (whether automatic or manually set) around provisioned IOPS or throughput in MB/s. We might see noise from other stores, other processes, other “noisy neighbours” which we see from time to time when benchmarking on AWS, and we need to be able to react to that instead of trying to use up a theoretical target throughput.

If you're referring to the bytesPerSecond variable, that's auto tuned according to the latencies, so I wouldn't say it's a theoretical limit. It'll be as high as it can be, without causing increased latencies. I think rocks db also has a similar variable in their rate limiter.

Other stores isn't an issue, because the rate limiter will be shared amongst all the stores, just like the block/table caches. Even if it wasn't, I think bytesPerSecond, should still just be what the maximum throughput is, without causing increased latencies.

itsbilal commented 3 years ago

The principle that Sumeer mentioned in the standup and something that I'm also thinking along the lines of, is twofold:

1) Leave existing compaction processes largely untouched and unthrottled, like they already are, with the exception of concurrent compactions beyond the first one (more on this later). Leaving flushes and memtables/WAL writes unthrottled is also a good idea, as that's closest to the write-hot path observed by the user. 2) Schedule additional compactions in a measured amount, when necessary (high L0 read-amp, as we already do by increasing compaction concurrency), OR when feasible (writes far from saturation). Any additional compactions beyond the first one could have the pacer set.

The definition of what "writes far from saturation" is, is something that's going to come from your experimentation, and what you're describing around bytesPerSecond and L will likely work once we validate it with experiments.

The reason why I keep bringing up compaction concurrency is because it's the most direct (and maybe the only way?) in actually increasing compaction throughput; do more of them at the same time! The change that Peter made to reduce compaction concurrency to below MaxConcurrentCompactions showed a significant improvement in write-heavy workloads, because concurrent compactions can steal away write IOPS from WAL writes and flushes (which we don't want to slow down for aforementioned reasons). So we probably want to reduce compaction concurrency when we see user writes slowing down (using what metric and thresholds you find most helpful), and increase it again if compactions slowed down enough (this mechanism already exists, it looks at L0 sublevel count).

I know it seems unintuitive at first that we'd actually slow compactions when we observe more user writes, but we're just lowering write-amp by not duplicating too much effort.

My understanding is that low L0 read-amp shouldn't limit compaction throughput. It should only be limited by not having enough IO resources available.

I see what you mean, but I worry it might be too reactive. If we don't look at L0 read-amp and just schedule the max number of concurrent compactions always, and then we observe writes slow down, our only option is to pace down existing compactions, which (if there are a lot of them) are going to hold up new compactions as you can't conflict with running compactions. The advantage of reacting to high user writes as early as possible is that we can sidestep this conflict. Letting one compaction run at full speed and not start new ones is cleaner, more CPU/IO efficient, and blocks future compactions less.

Other stores isn't an issue, because the rate limiter will be shared amongst all the stores, just like the block/table caches.

Sharing the pacer(s) is probably not a good idea. The stores could all be on separate disks, or they could not. Each store should figure out its own pacing. We can share config variables, but we don't want a hot store to starve out writes from a cold store.

An additional thread of thought that we haven't brought up yet: measuring just how much we actually need to worry about reads. Can we look into iterator counts / stats to see how aggressively we need to react to high read-amp? This might be a whole new pandora's box so we can split that into a separate conversation / issue.

bananabrick commented 3 years ago

I did some more benchmarking of ycsb/F/1024 while observing the IO latencies and throughputs.

1. ycsb/F/1024 with 10 concurrent workers.

For some reason the benchmark starts off doing lots of writes with high w-amps, so we see the throughput of the ssd being saturated, and the latencies spiking up to 100s of ms, but later the throughput decreases to about 50% of total capacity, and latencies become sub millisecond.

For the commit pipeline commit latency I was sampling, it was initially between 1-30 ms, but later becomes sub 5 ms.

pebble ops/sec throughput

pebble fsync latencies fsync

pebble batch commit latencies commit

2. ycsb/F/1024 with 25 concurrent workers.

The benchmark starts doing lots of writes and with high w-amps, so we see the ssd being saturated, and latencies spiking up again, but later the throughput decreases to ~80% of max throughput, and latencies shown through the gce console become lower.

After the initial increase in commit latencies, they consistently stay in the 0-5ms range.

pebble ops/sec throughput

pebble fsync latencies fsync

pebble batch commit latencies commit

3. ycsb/F/1024 with 100 concurrent workers.

With a 100 writers, we see the same initial increase in latency, and we see the throughput hitting the max throughput of the ssd, but the IO latencies are in the 1-4 ms range.

Since the ssd throughput is getting saturated, the commit pipeline commit latencies, fluctuate between 0-30 ms, pretty randomly.

pebble ops/sec throughput

pebble fsync latencies fsync

pebble batch commit latencies commit

4. ycsb/F/1024 with 300 concurrent workers.

With 300 writers, we see the same initial increase in latency, and we see the throughput hitting the max throughput of the ssd, but the IO latencies are still in the 1-4 ms range.

Since the ssd throughput is getting saturated, the commit pipeline commit latencies still fluctuate between 0-30+ ms, pretty randomly.

pebble ops/sec throughput

pebble fsync latencies fsync

pebble batch commit latencies commit

Note that the initial spike in latencies seem to correspond with a higher KB/IO_op in the graph. I'll also be plotting the commit latencies with time/concurrent workers, to see if there's any other patterns there.

bananabrick commented 3 years ago

Turns out that commit latencies just spike up when there's a lot of writes happening in the system, but they don't accurately reflect any ssd level bottlenecks. Maybe there's some variables there we can eliminate?

ycsb/F/1024 with 300 concurrent workers, but with an ssd which can handle the load.

pebble ops/sec throughput

pebble fsync latencies fsync

pebble batch commit latencies commit

There's no IO level bottlenecks here, as seen from the throughput numbers, and the fsync latencies, which are in the 0-2ms range. But the commit latencies are massive. Which makes me think that there are other bottlenecks to batch commit latencies.

The fluctuation in the batch commit latencies can be explained by the constant write stalls, I think.

itsbilal commented 3 years ago

Your last question can likely be answered by time spent in the pending queue, if a lot of writes are happening but the ssd can handle them, I can see the synchronization in the commit pipeline slowing things down more than actual disk writes / syncs. If you want to leave the mechanism of the commit pipeline behind and just measure the write performance, you could try measuring latencies of the actual WAL write (not sync) that happens in prepare - that might yield some interesting results. You could also normalize it by bytes being written on that Write call, which should be easy given the buffer is right there.

The other question is whether a "too many writes are happening" signal is useful on its own too, and I'm not sure if it is - we're more worried about resource starvation. But maybe someone else has other thoughts.

bananabrick commented 3 years ago

@itsbilal thanks, I'll look into those aspects.

I ran ycsb/F/1024 with 300 concurrent workers again, after removing the pebble level write stall bottleneck(by increasing compaction concurrency), to see what the latencies look like.

pebble ops/sec throughput

pebble fsync latencies fsync

pebble batch commit latencies commit

I don't think commit pipeline latencies are usable, but the specific write/sync latencies in the pipeline should be usable. \

bananabrick commented 3 years ago

Summarizing the direction this is headed, and also summarizing the discussion in this doc: https://docs.google.com/document/d/1wYqLFBqfqX18FzC1f4rEkK-VDtB1ggNQBdeQwrCLrI4/edit.

The high level idea is that we want to use low IO latencies along the pebble write paths to determine if it's okay to increase the compaction concurrency. So, we'll be auto tuning compaction concurrency based on write path IO latencies.

The graphs/benchmarks in this issue were used to observe IO latencies under various ssd loads.

Some of the concerns are:

sync/write latencies depend on the size of the data being written.
- not sure what the solution is here, but in the worse case, we just won't increase compactions, if sync latencies are too high.
If we hardcode any latency thresholds, then they won't work perfectly across different ssds with different latencies.
- They'll still work across faster ssds, and we just won't auto tune for slower devices.
We want to make sure that increasing compactions/doing more IO won't slow down read performance.
- Need to run experiments for this.
What write/read latencies are "good enough" for pebble, where "good enough" latencies won't bottleneck the rest of pebble.

bananabrick commented 3 years ago

Posting some results with 64 byte values.

ycsb/F/64 with 10 concurrent workers, on a single local ssd.

pebble ops/sec throughput

pebble fsync latencies fsync

ycsb/F/64 with 25 concurrent workers, on a single local ssd.

pebble ops/sec throughput

pebble fsync latencies fsync

ycsb/F/64 with 50 concurrent workers, on a single local ssd.

pebble ops/sec throughput

pebble fsync latencies fsync

ycsb/F/64 with 100 concurrent workers, on a single local ssd.

pebble ops/sec throughput

pebble fsync latencies fsync

ycsb/F/64 with 300 concurrent workers, on a single local ssd.

pebble ops/sec throughput

pebble fsync latencies fsync

ycsb/F/64 with 500 concurrent workers, on a single local ssd.

pebble ops/sec throughput

pebble fsync latencies fsync

bananabrick commented 3 years ago

I thought a bit about the ways to get past the issue of sync latencies depending on the size of the data being synced, and different ssds/file systems having different performance characteristics. This problem needs to be solved, as it prevents auto tuning from working on a majority of the machines.

Proposal: Assuming that there is only one cockroach process running on a single machine, cockroach will call into a pebble function preOpen, before calling Open even once. The preOpen function will issue some commands to the file system and sample latencies, and return a struct S which contains the latency information. The struct S, will then be passed in as a pebble option, when Open is called, to any Pebble engine which is created from cockroach.

The struct S will contain latency information for the hardware under low(Is this guaranteed?) resource utilization, because preOpen samples latencies before any Pebble engine is created.

preOpen will sample sync latencies for different data sizes. For example, we can sample latencies for sizes of 10KB, 100KB, 250KB, 500KB, 1MB. This will give us a baseline of how fast the machine is, and will also give us a baseline for sync latencies at different data sizes.

The tradeoff here is that, preOpen function will take some time t to run, which slows down the startup time of cockroach by t. We could also call preOpen in parallel, with the rest of the cockroach startup. What values of t are acceptable? I think t=10 seconds could work, but I haven't tested this. Maybe even t = 1 second of sampling will work, but I haven't tested it, yet.

sumeerbhola commented 3 years ago

Regarding auto tuning, I agree this is a hard problem. And I don't necessarily think we need auto tuning for the first version. But if we do decide to do something here, here is a rough summary of my current thinking on various signal options (including latency) and how suitable they are for auto tuning.

WALMinSyncInterval set to 0, but could be increased by a CockroachDB cluster setting. Want a signal that is robust to different settings of this, under the assumption that the frequency of altering this setting in a running system is extremely low.
Assume the workload in terms of the size of batches and the frequency of batches requesting sync changes slowly over time, and we don't need to worry about it when considering signal values within the last 1min.
Signal options:
- Measure latency of write+sync in LogWriter.flushPending. We need to accumulate these latency values across many samples. Difficulties:
- Samples may vary in size. If queue is building up due to slowness the byte size of these samples will also increase, so they are not simply a function of the user workload.
- Some samples sync and some don't, which affects the latency.
- Write throughput measured over time interval T in LogWriter.flushLoop by only timing the parts where work is being done. Ignore whether work is syncing or not. Difficulties:
- Write throughput may not immediately decline due to sync latency increase since the LogWriter can increase the batch and amortize the sync over larger number of bytes.
What is the baseline "low load" value for the signal? Difficulties:
- Noisy neighbor: don't know when it is present, so don't know the baseline.
- Don't know if reducing compaction load will actually improve the signal. Maybe it is already as good as it can be because something has changed in the hardware etc.
Can we expect users to provide a threshold that represents the boundary of what is acceptable? Seems hard. If their workload is typically running with write+sync of B bytes taking 100ms, and they configure the threshold to 500ms, then we can decrease their commit throughput by 5x without realizing we have done something wrong. Usually users are not running close to the cliff in terms of no idle time in the LogWriter, but users typically don't know if they are, until something goes bad.
Another alternative would be measure work time/total time in the LogWriter.flushLoop. This would represent "utilization". If utilization were say < 80% we could increment compaction concurrency. We could measure utilization over 5s intervals and do exponential smoothing, so we don't react too quickly to a short write burst. This has the benefit that we are (a) directly measuring whether user writes are suffering, and (b) don't need a baseline low load value. Deficiencies:
- Doesn't tell us whether reducing compactions will actually make these user writes faster. It is possible that the answer is no and the hardware provisioning is good enough that we shoud run with more compactions to absorb these writes.
- Utilization can be equally high if we are writing and syncing 10 bytes in every loop, versus writing and syncing 100 bytes in every loop. That is, the fixed overhead is not well accounted for.
We should consider measuring this utilization metric with some experiments with moderate to high load and see whether it is worth pursuing. I think the deficiencies could be addressed by using latency or throughput as a secondary signal (and can be addressed in future improvements).

sumeerbhola commented 3 years ago

measure work time/total time in the LogWriter.flushLoop. This would represent "utilization".

We can scale this time utilization, computed over an interval T, by the mean value of bytes written by flushPending/(blockSize * 16) which is the buffer utilization. This overall utilization number should not suffer from the second deficiency I mentioned above.

sumeerbhola commented 3 years ago

The remaining weakness with this WAL utilization signal is:

Doesn't tell us whether reducing compactions will actually make these user writes faster. It is possible that the answer is no and the hardware provisioning is good enough that we should run with more compactions to absorb these writes.

I don't think this will actually be a problem -- here is a cleaner version of a claim from a side conversation with @bananabrick (and look at this with a critical eye, since there is a likelihood of error in reasoning):

Consider a scenario where the WAL utilization is high. The concern is that this could be high even if there is no resource bottleneck, because we have single-threaded WAL writing, and that we will needlessly not increase compaction concurrency (from our baseline of 3) just when we need that extra compaction concurrency.
For the WAL utilization to be high, the buffer utilization needs to be high. So we will be writing a lower bound of blockSize*16 for each sync. So at least 512KB per sync.
Consider a write amplification of W. For every byte written to the WAL we need to compact W to keep up.
Throughput of a single threaded compaction vs single threaded WAL write (LogWriter.flushLoop):
- CPU: The cpu per byte written will be lower for the WAL write.
- IO: The main difference is the syncs. Compactions also do sync_file_range every 512KB (Options.BytesPerSync), so the sync costs may not differ much.
Let R be the ratio of WAL byte throughput to a single compaction write throughput, when there is no resource bottleneck (and so syncs are not slower than usual).
For compactions to keep up with the WAL at high WAL utilization, we need R*W concurrent compactions. Being pessimistic about WAL throughput and optimistic about write amplification, say W=20 and R=0.5. That is 10 concurrent compactions. If high WAL utilization prevents the auto tuned compaction concurrency to not exceed 10, even though we are not resource bottlenecked, I would consider that a big win over our current cap of 3.

nicktrav commented 3 years ago

I did a re-read through of the conversation so far, and adding my thoughts on the implementation and the operational implications of our work here.

Regarding auto tuning, I agree this is a hard problem.

Agreed. I even wonder whether this is worthwhile pursuing at the Pebble layer at all.

My own experience elsewhere with systems that auto-tune is that the operator will likely end up in a situation where they are "fighting" against the "intelligent system" that is trying to shift performance in one direction, but the operator wants it to do something different. There's inevitably a bug in the control loop, and debugging such a thing in the midst of an outage is painful.

Add in the fact that we have admission control at the Cockroach layer. Adding a layer of auto-tuning below this feels like we might be multiplying the complexity of the system. There's some irony here in that this issue is about reducing complexity by decreasing the number of "knobs". Instead we're keeping the knobs and letting some code fiddle with them in real-time.

A thought / suggestion: can we expose an API that allows the caller to a) query Pebble's observed view of its own performance (various IO-related metrics, and eventually CPU-bound metrics for compute heavy tasks), and b) alter the various knobs. For example, alter compaction concurrency, WAL sync interval, etc. My understanding is that many of these params can be set, but it's not clear to me whether altering them at runtime has any effect, or would be safe to do.

Presumably we'd then bake in the "sentience" into admission control in Cockroach to call into this Pebble API and twiddle the knobs from there. An operator could also just completely opt-out of admission control, and opt to tune the knobs directly. Though presumably, this would be for very advanced and knowledgeable operators.

Separately, I was also thinking about whether it would be possible to directly query the operating system for the performance of the filesystem (and other things). On Linux, there's eBPF which gives you a means of hooking into various structures in kernel-land. This is like a looking glass, through which we'd be able to see at a very low level what is going on.

In exploring this idea, I think I convinced myself that this would be unsavory (and probably dangerous), for a few reasons - it's not portable (even if not using eBPF, we'd need multiple system-specific ways of querying for these metrics); it's a leaky abstraction (the metrics are system-wide and not specific to Pebble - e.g. should Pebble slow down if another process is eating up IO bandwidth?); it's wildly complex.

With that said, it seems like we're already agreed in the approach of using Pebble-specific metrics to expose and gauge how it is performing. I'll leave the above as just food for thought.

tldr: I'm not sold on auto-tuning in Pebble. Can we instead expose an API for dynamic control? Using metrics specific to Pebble seems like a decent approach of inferring Pebble's own performance / progress, rather than using raw FS / OS metrics.

sumeerbhola commented 2 years ago

My own experience elsewhere with systems that auto-tune is that the operator will likely end up in a situation where they are "fighting" against the "intelligent system" that is trying to shift performance in one direction, but the operator wants it to do something different.

Agreed that this happens, though rarely. Adding a temporary override is an easy way to fix it. But without auto-tuning in my experience one is worse of.

And exposing an "API for dynamic control" externalizes the complexity to a piece of code who may not have the requisite signals to make the right decision, and will need to understand the various control points in Pebble. For instance, the "WAL utilization" signal (which I am very hopeful about) mentioned in https://github.com/cockroachdb/pebble/issues/1329#issuecomment-964463877 has to be gathered inside Pebble.

btw, I don't think this will interfere with admission control. The background work tuning will not try to come anywhere close to 100% utilization, while admission control is trying to do queueing and relative prioritization of work when utilization is close to 100%.

itsbilal commented 2 years ago

Ah, this conversation has progressed quite a bit since I last engaged in it.

One additional thing on the "too much automation can fight against the operator" worry, while I agree with it as a general concern, we've narrowly scoped this project to just be the divvying up of additional IO resources when we know it'd likely be a good idea and a safe idea to use those. In particular, our focus seems to be on tuning compaction concurrency. We will still fall back to the same baseline as before under any sort of IO-constrained situation, so that still gets us a net win of finishing this project.

All the additional knobs we're talking about just control how aggressive or conservative we are with assuming we have excess resources available, but our baseline should still be "good enough" to get us out of any hairy resource-constrained situation.

It's also generally true that we have very few user-documented or user-visible knobs in Pebble to start, so whether we add more or remove more doesn't make too big a difference. If anything, I interpreted the higher level goal of this project to be "have Pebble make better decisions on its own, and tune Pebble's existing knobs better by default so they don't leave performance on the table".

sumeerbhola commented 2 years ago

(more speculative: came up in https://cockroachlabs.slack.com/archives/C01SRKWGHG8/p1664807892649489) Read-driven compactions can also be more aggressive if there is spare CPU and disk bandwidth. In that discussion, there was not much spare CPU, since CPU was being consumed in reads that could be made more efficient. Doing more aggressive read-driven compactions since it will reduce cpu utilization is a trickier proposition, since it requires measuring resource (CPU, disk) usage in reading various parts of the keyspace. But since this resource usage is inside Pebble, it is something we can do, given the golang runtime instrumentation we now have to measure per-goroutine running time.

jbowens commented 2 years ago

Doing more aggressive read-driven compactions since it will reduce cpu utilization is a trickier proposition, since it requires measuring resource (CPU, disk) usage in reading various parts of the keyspace. But since this resource usage is inside Pebble, it is something we can do, given the golang runtime instrumentation we now have to measure per-goroutine running time.

This is also a more generalized version of #918. In the escalation cockroachlabs/support#1830, it looks like we might need better heuristics around compacting point tombstones that slow reads.

jbowens commented 1 year ago

These compaction-related ideas have the following need in common: detect the current resource utilization (or overload of a resource), and use that to adjust the amount of concurrent background work, so as to not affect foreground traffic.

In the taobench import (internal link), we saw one node incurring higher read-amp and as a result, higher CPU utilization for the same foreground traffic. Do we need to adapt compaction concurrency even at the expense of foreground traffic, for the benefit of future foreground traffic? It seems like a really hard problem to get right, but it feels like compaction concurrency should not only be increased when there are spare resources, but also regardless when there's sufficiently large compaction debt.