Rocksdb using up all memory due to index/filter blocks

zaidoon1 commented 3 weeks ago

A lot of historical context can be found here: https://github.com/facebook/rocksdb/issues/12579

background, when I first started using rocksdb, I had the following options set:

  cache_index_and_filter_blocks: 1
  cache_index_and_filter_blocks_with_high_priority: 1
  pin_top_level_index_and_filter: 1

Then I saw that rocksdb was maxing out memory/cpu usage and after opening the issue above and doing some investigation with the help of @ajkr . We realized it was caused by index/filter blocks thrashing. The workarounds that were proposed at the time https://github.com/facebook/rocksdb/issues/12579#issuecomment-2094349798 was:

 cache_index_and_filter_blocks=true with unpartitioned_pinning = PinningTier::kAll: index block memory usage counts towards block cache capacity. Pinning prevents potential thrashing.

or

cache_index_and_filter_blocks=false: Index block memory usage counts toward table reader memory, which is not counted towards block cache capacity by default. Potential thrashing is still prevented because they are preloaded and non-evictable in table reader memory.

and I went with just setting cache_index_and_filter_blocks=false. This was fine for many months until recently where rocksdb starting using up all memory again:

As we can see from the last screenshot, the "url" cf index/filter blocks started using up a significant amount of memory, presumably the read patterns changed as we only see this on a few machines.

First, it's odd that the url cf is using this much to begin with even if we were loading all the index/filter blocks (due to not caching index and filter blocks in block cache). there is about 150 SST files for the url cf, i'm using a prefix extractor + ribbon filter with 10 bits and 302607723 kvs. the url cf size on disk is 8.34 GiB and yet index and filter blocks are somehow using 4.68 GB in memory and still going up? Something doesn't add up here?

second, let's assume this amount of usage is normal, what is the workaround here? Apparently I can't pin all index and filter blocks in memory because i don't have enough memory, but also I don't want to just enable cache_index_and_filter_blocks because I don't want to run into the thrashing issues that I saw in the previous issue linked. Is there specific rocksdb settings to solve both issues?

I was reading https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB#indexes-and-filter-blocks and it states:

For each data block we store three information in the index: a key, a offset and size. Therefore, there are two ways you can reduce the size of the index. If you increase block size, the number of blocks will decrease, so the index size will also reduce linearly. By default our block size is 4KB, although we usually run with 16-32KB in production. The second way to reduce the index size is the reduce key size, although that might not be an option for some use-cases.

Should i change block size from the default 4KB to 16KB?

OPTIONS.txt

zaidoon1 commented 3 weeks ago

I ran the sst_dump tool to analyze the sst fils. I did it on a server that is also seeing the same memory pattern but has a much smaller db size to make dumping the data easier/faster:

sst_dump results: dump.txt

example:

 # data blocks: 39433
  # entries: 904072
  # deletions: 0
  # merge operands: 0
  # range deletions: 0
  raw key size: 260872449
  raw average key size: 288.552736
  raw value size: 0
  raw average value size: 0.000000
  data block size: 67110378
  index block size (user-key? 1, delta-value? 1): 44931865
  filter block size: 98293
  # entries for filter: 127655
  (estimated) table size: 112140536
  filter policy name: ribbonfilter
  prefix extractor name: uid_extractor
  column family ID: 1
  column family name: url
  comparator name: leveldb.BytewiseComparator
  user defined timestamps persisted: true
  merge operator name: nullptr
  property collectors names: []
  SST file compression algo: ZSTD
  SST file compression options: window_bits=-14; level=32767; strategy=0; max_dict_bytes=0; zstd_max_train_bytes=0; enabled=1; max_dict_buffer_bytes=0; use_zstd_dict_trainer=1; 
  creation time: 1712869978
  time stamp of earliest key: 0
  file creation time: 1724035417
  slow compression estimated data size: 0
  fast compression estimated data size: 0

looking at the SST files for the url cf, we see that the index block size for each url sst file added together equates to the 300mb of memory used by the url cf so that tracks/makes sense.

Based on this, what options do I have to deal with large index block size? i guess changing the block size from 4KB to 16KB should cut the index block size by a factor of 4 if i'm understanding this correctly? But also in general, as the db size grows, how do we make sure that index block size doesn't result in OOM?

zaidoon1 commented 3 days ago

setting the block size to 16KB "solved" the problem for me but that's because my database size is small so it's a workaround more than a solution.

facebook / rocksdb

Rocksdb using up all memory due to index/filter blocks #12942