High read latencies even with sufficiently large block cache size

areyohrahul commented 11 months ago

Context:

Hi, my RocksDB workload is entirely read-based.

The data in it is generated in a single server and the db is then replicated to all the replicas to bootstrap data from it.

There are roughly 20M keys and the data size is around 12G.

The data has a serialized POJO as a value that’s associated with its ID.

Problem:

My read latencies are way too high. So, I tried doing a lot of things to optimize that.

Changed the filter block bits, pinned L0 blocks in cache, increased the block cache size to more than the DB size

Even after all this, my data block cache miss is almost equal to the block cache hit. Weirdly, my LRU cache usage is still 12.36G out of 14G, what should I do?

One more observation. I have noticed that even increasing the block size doesn’t have any impact on the read latencies even though the cache is not even fully utilized. The reads, all random in nature somehow always increase the disk util. However, if the DB is compacted before reading, no disk util is observed and the read latencies are within limits.

Why would reads go to the disk if the entire DB is in the cache? I'm sure no compaction or any other process using the disk was running at that time.

I'm attaching my LOG file here as well - https://gist.github.com/areyohrahul/ff8594c7413d79cc5fab2c4eacbe17dd

areyohrahul commented 11 months ago

I was looking at the LRUCache docs here https://github.com/facebook/rocksdb/wiki/Block-Cache#lru-cache

This says that a lock is acquired even during a read from a shard. And, I remember reading somewhere that if the time taken to fetch the data from the cache exceeds a certain threshold (because of lock etc) then that constitutes a cache miss, and the data is fetched from the disk.

Is this observation correct?

If yes, how do I fix it? Should I increase the num_shard_bits?

areyohrahul commented 11 months ago

I'm quite convinced now that this problem is happening because of the inappropriate block cache config because the results are very random. Sometimes I'm able to fetch 5M keys/min and sometimes it's just 500K/min. All the configurations are the same in each run.

But, how do I find the right config for my cache? What factors are there to be considered?

@jowlyzhang can you help, please?

wolfkdy commented 11 months ago

After reading your attached log file, I found

 block_cache_options:
    capacity : 8388608

Seems you should configure your blockCacheSize and make sure your machine's memory is larger than the configured blockCacheSize and larger than your workingset, that is PhysicalMem > BlockCacheSize > WorkingSetSize

areyohrahul commented 11 months ago

Hey @wolfkdy, thanks for replying to this. The CF to look for here is f::s::n::l The block cache size for it is roughly 14G

areyohrahul commented 11 months ago

Also, I don't see any eviction logs for the cache. Then, if the cache miss is happening, it should eventually load up everything on the cache and work from there, right?

My major concern is why cache misses are happening even though my cache size is quite big and no evictions are seen

facebook / rocksdb

High read latencies even with sufficiently large block cache size #11959