facebook / rocksdb

A library that provides an embeddable, persistent key-value store for fast storage.
http://rocksdb.org
GNU General Public License v2.0
28.09k stars 6.25k forks source link

GetAggregatedIntProperty reports confusing result on such property as "block-cache-capacity" "block-cache-usage" #12687

Closed jxz27 closed 2 months ago

jxz27 commented 2 months ago

I created a block cache object of 10MB shared across, say, 10 column families (in addition to the default one). I then call GetAggregatedIntProperty against "block-cache-capacity". The LOG file confirmed that all column-familes pointed to the same block-cache object.

Expected behavior

Since the block-cache object is shared by all column families, I assume it shall return 10MB as capacity.

Actual behavior

However it returns 10MB * 11 = 110MB as capacity

Steps to reproduce the behavior

hx235 commented 2 months ago

Any interest of fixing this?

jonahgao commented 2 months ago

I'd like to try this.

ngc4579 commented 2 months ago

Hm, we might have run into this as well. We've set a shared bounded cache in our Kafka Streams applications, restricting RocksDB off-heap memory to like 3Gi (as suggested here: https://docs.confluent.io/platform/current/streams/developer-guide/memory-mgmt.html#rocksdb). Metrics, such as block_cache_usage or block_cache_pinned_usage now report absurd readings. We get e.g. several hundreds of Mebibytes per task which if true would instantaneously OOMKill our apps and nodes... could this be related?

(In contrast, e.g. size_all_memtables shows more reasonable values and might not be affected by the issue...?)

jonahgao commented 2 months ago

Hm, we might have run into this as well. We've set a shared bounded cache in our Kafka Streams applications, restricting RocksDB off-heap memory to like 3Gi (as suggested here: https://docs.confluent.io/platform/current/streams/developer-guide/memory-mgmt.html#rocksdb). Metrics, such as block_cache_usage or block_cache_pinned_usage now report absurd readings. We get e.g. several hundreds of Mebibytes per task which if true would instantaneously OOMKill our apps and nodes... could this be related?

(In contrast, e.g. size_all_memtables shows more reasonable values and might not be affected by the issue...?)

This issue will only affect the statistical information displayed to the user and will not impact the functioning of the cache. Therefore, I think it should be unrelated.

ngc4579 commented 2 months ago

@jonahgao Yes, that's exactly what I was referring to - statistical Information, i.e. metric readings. Caches seem to be functioning normally, at least as far as I can tell. Only the reported metrics seem unreasonably high.

jonahgao commented 2 months ago

@jonahgao Yes, that's exactly what I was referring to - statistical Information, i.e. metric readings. Caches seem to be functioning normally, at least as far as I can tell. Only the reported metrics seem unreasonably high.

That might be related if it uses GetAggregatedIntProperty and there are multiple column families and a shared block cache.

ajkr commented 2 months ago

Fixed by #12755