facebook / rocksdb

A library that provides an embeddable, persistent key-value store for fast storage.
http://rocksdb.org
GNU General Public License v2.0
28.25k stars 6.27k forks source link

Feature request: rate limit compaction triggered by periodic compaction seconds/ ttl only #12536

Closed zaidoon1 closed 2 weeks ago

zaidoon1 commented 5 months ago

my db size is small but I do have a significant amount of deletes so I set a db ttl/periodic compaction seconds to make sure the tombstones are deleted every few hours. However, this caused lots of cpu usage spikes as reported in https://github.com/facebook/rocksdb/issues/12220 . I then rate limited compactions which solved this issue, HOWEVER, I thought write stalls can only be caused if flush is slow which is the main reason why I wanted to rate limit compaction but not flushes. It turns out, my understanding was incorrect, we do in fact stall if compaction is slow:

Screenshot 2024-04-15 at 12 34 02 AM

Given this information, what I would like to do instead is rate limit compaction triggered by db ttl/periodic compaction seconds since this compaction is mainly a clean up operation that doesn't need to happen immediately while at the same time, making sure that compaction triggered to make sure rocksdb "work fast" is not rate limited to avoid stalls.

Note that I'm using rocksdb in rust so i'm relying on the c apis to control rocksdb behaviour.

The alternative right now is to play around with the rate limiting so that I don't impact rocksdb write operations while at the same time making sure cpu doesn't spike significantly when periodic compaction seconds/db ttl is running which is trickier to balance.

zaidoon1 commented 4 months ago

@ajkr what do you think about this? Is there a quick fix that I can implement or will this be more involved?

ajkr commented 4 months ago

What are your settings for compaction style, TTL/periodic seconds, and how is data deleted? I am thinking there might be other ways to help with the compaction spikes particularly if you're using leveled compaction style and RocksDB's deletion APIs (vs. other mechanisms like a compaction filter to delete data).

zaidoon1 commented 4 months ago

when deleting I use rocksdb_writebatch_delete_cf, I don't have any compaction filters, etc.. TTL is set to 1800 seconds, and compaction style is whatever the default is. Here is my options file (I don't use the default cf so can ignore any options related to that):

OPTIONS.txt

ajkr commented 4 months ago

Thanks for the info. I was wondering if you'd be interested in trying compaction_pri = kRoundRobin? Round-robin compaction style simply picks files within a level by cycling through them in order. Whereas the default compaction style (kMinOverlappingRatio) picks files according to some heuristic that can form hotspots (key ranges from which files are repeatedly picked) and coldspots (key ranges from which files are rarely or never picked).

I suspect kRoundRobin should work better with aggressive TTL settings. That's because round-robin picks the oldest data in the level to compact, saving work for TTL compaction later. In the best case (write rate is high enough that a full cycle of round-robin compaction completes in each level before any file's data age reaches the TTL), there would be no files compacted for TTL reason at all.

zaidoon1 commented 4 months ago

got it, that's definitely good to know. I'll try out kRoundRobin and report back.

zaidoon1 commented 2 months ago

Here is the results after switching to kRoundRobin:

Screenshot 2024-07-04 at 11 40 55 AM

I don't see much of a difference. Here is some interesting stats at the time of one spike:

Screenshot 2024-07-04 at 11 56 56 AM

I don't have a reason to think it's something else given that when I rate limited compaction, the problem was solved, I am using ribbon filters instead of bloom which is supposed to use up more cpu but I don't think that's related given that the rate limiting wouldn't have affected it?

I'm thinking of a potential workaround, right now, I have 4 cfs that have the same ttl to trigger compaction, what if I change the ttl slightly to make sure they are not triggering at the same time? In theory this should reduce the number of compactions running at a given time (at least related to ttl, not write rate) and that should reduce cpu usage. Thoughts?

zaidoon1 commented 2 months ago

I'm thinking of a potential workaround, right now, I have 4 cfs that have the same ttl to trigger compaction, what if I change the ttl slightly to make sure they are not triggering at the same time? In theory this should reduce the number of compactions running at a given time (at least related to ttl, not write rate) and that should reduce cpu usage. Thoughts?

I tried this as well, and unfortunately that didn't help either.

zaidoon1 commented 2 months ago

I'm starting to think this is NOT related to compaction. Let's take a look here:

Screenshot 2024-07-06 at 3 38 04 PM

we can see the cpu spikes correlate with the block cache MISSes. @ajkr any idea what I should look at next?

zaidoon1 commented 2 months ago

is it possible to get block cache stats (block cache miss) per cf so I can narrow down which cf is causing the issue

zaidoon1 commented 2 months ago

At this point, my guess is the compression settings i'm using is to blame:

 opts.set_compression_type(DBCompressionType::Lz4);
 opts.set_bottommost_compression_type(DBCompressionType::Zstd);
 opts.set_bottommost_zstd_max_train_bytes(0, true);

I assume Zstd for bottommost is definitely not great with cpu usage. Per https://rocksdb.org/blog/2021/05/31/dictionary-compression.html I'm thinking of trying:

(zstd only) EXTRA_CXXFLAGS=-DZSTD_STATIC_LINKING_ONLY: Hold digested dictionaries in block cache to save repetitive deserialization overhead. This saves a lot of CPU for read-heavy workloads. This compiler flag is necessary because one of the digested dictionary APIs we use is marked as experimental. We still use it in production, however.

Is there a gotcha with enabling this option? I will not be enabling cache_index_and_filter_blocks with this to avoid a repeat of https://github.com/facebook/rocksdb/issues/12579.

And if that doesn't help then I'm thinking of just removing:

 opts.set_bottommost_compression_type(DBCompressionType::Zstd);
 opts.set_bottommost_zstd_max_train_bytes(0, true);

which would just use Lz4. Thoughts?

zaidoon1 commented 2 months ago

also if DZSTD_STATIC_LINKING_ONLY is set correct, should I expect https://github.com/facebook/rocksdb/blob/v9.3.1/monitoring/statistics.cc#L42-L49 to be populated?

zaidoon1 commented 2 months ago

after doing some testing, I can confirm the cpu spikes is related to compaction + using zstd compression in the bottommost level. Removing:

 opts.set_bottommost_compression_type(DBCompressionType::Zstd);
 opts.set_bottommost_zstd_max_train_bytes(0, true);

results in a more stable cpu usage. However, it does increase disk space usage by about 25% which is not ideal. I'm thinking of setting bottommost compression type to lz4 at this point to gain some compression benefits while keeping cpu usage reasonable. @ajkr What do you think? Is there a way for me to keep zstd while reducing the cpu usage impact?

ajkr commented 2 months ago

Is there a gotcha with enabling this option?

Sort of. A different option (max_dict_bytes) needs to also be enabled, otherwise there is no dictionary to build a digested dictionary from. Depending on your data patterns, you could find major space savings by enabling dictionary especially considering the small data block size (4K)

@ajkr What do you think? Is there a way for me to keep zstd while reducing the cpu usage impact?

There are ways to shift it around like rate limiting or deprioritizing compaction CPU (LowerThreadPoolCPUPriority()).

Another consideration is maybe the CPU impact is more about compression than decompression. Although block cache misses points to decompression, the two are interleaved during compaction so it could be either. If compression is a major source of CPU you can try CompressionOptions::level < 3 (3 is the ZSTD default) for the bottommost compression settings.

zaidoon1 commented 2 months ago

something was still not adding up looking at my graphs, so I disabled zstd all together in bottommost compression again and let it run for half a day or so and here is all the stats i have collected:

so starting with the cpu spikes, we can see 3 cpu spikes that I'm interested in knowing the cause of:

Screenshot 2024-07-10 at 11 20 41 PM

also memory usage went up, i assume it's related to disabling zstd as well

Here is all other rocksdb stats I have that don't show anything interesting happening at the time of each cpu spike:

Screenshot 2024-07-10 at 11 23 13 PM Screenshot 2024-07-10 at 11 23 28 PM Screenshot 2024-07-10 at 11 23 47 PM Screenshot 2024-07-10 at 11 24 02 PM Screenshot 2024-07-10 at 11 24 18 PM Screenshot 2024-07-10 at 11 24 28 PM Screenshot 2024-07-10 at 11 24 40 PM Screenshot 2024-07-10 at 11 24 56 PM Screenshot 2024-07-10 at 11 25 20 PM Screenshot 2024-07-10 at 11 25 28 PM

At this point, i'm confused on what really is the cause, I don't see compaction related writes/reads that match every single cpu spike.

Do you know what other metrics I should look at or have any theories?

ajkr commented 2 months ago

Maybe memtable flush? From the "Memory Usage of All Mem Tables" chart we can infer flush is happening to reduce the memory, even though "Number of Running Flushes" is always zero. Maybe the query for running flush count did not capture points when it was nonzero.

"Number of SST Files" also changes outside the time compaction is said to be running, so there could be hidden compactions happening too.

zaidoon1 commented 2 months ago

Maybe the query for running flush count did not capture points when it was nonzero.

This makes sense. I collect rocksdb metrics every 1 minute. This is not great to capture operations that run infrequently/very quickly. I'll update things from my end to run every 5 seconds or so to get a better picture.

Also now that you mentioned memtable flush, I'm pretty much running default rocksdb settings right now but I do have the following configured:

opts.set_write_buffer_size(256 * 1024 * 1024);
opts.set_max_write_buffer_number(4);

This maybe causing us to build large amount of data in memory so when flushing, we flush a big amount of data all at once?

Also while we are on this topic, something that didn't make sense to me is the metric:

Screenshot 2024-07-12 at 11 01 39 PM

shouldn't we be building up more data in memory before flushing the memtables given my configuration above?

Note the graph on the left uses the cf property estimate-table-readers-mem as the metric source and the one on the right uses size-all-mem-tables

Here is the option file as well for reference:

OPTIONS-000007.txt

zaidoon1 commented 2 months ago

oh I did think of something just now, every few hours, I do create a checkpoint to allow some background service to crawl and clean up the db (delete orphaned indexes).

@ajkr would the metrics:

/// Bytes read/written while creating backups
BackupReadBytes("rocksdb.backup.read.bytes"),
BackupWriteBytes("rocksdb.backup.write.bytes"),

track checkpoints?

zaidoon1 commented 2 weeks ago

Maybe memtable flush? From the "Memory Usage of All Mem Tables" chart we can infer flush is happening to reduce the memory, even though "Number of Running Flushes" is always zero. Maybe the query for running flush count did not capture points when it was nonzero.

I can confirm it was the memtable flush that is actually causing the cpu spikes. This is because I set max total wal size to 15MB as I was having issues with rocksdb taking a long time to start up after a restart due to wal replay. I've solved this by flushing cfs before shutdown so that when rocksdb starts up again, it doesn't have to replay a big WAL.

zaidoon1 commented 2 weeks ago

I'm going to close this issue as it's not related to compaction but I most likely will create a separate issue about how to make memtable flush less "noticeable" cpu usage wise. Thank you for the help!