christian-esken commented 6 years ago

I have a rocksdb of 400GB size. It uses 40GB RSS and keeps growing, eating up all the Swap. Finally it eats up all memory. The rocksdb version is 5.8.6.

Expected behavior

According to https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB the RocksDB instance should only use about 8GB - I have done calculations with a worst-case scenario.

Index size: 4,61 GB Bloom Filters: 3.9 GB Memtable: 10*128MB = 1,28GB Block Cache: 0.5 GB Blocks pinned by iterators: 14MB Sum: Approximately 10,3GB

I also added memory not mentioned on the Wiki, namely: Memtables pinned by iterators: 1,28GB (estimation: presuming they pin some older memtables) Compaction (I understand that they could double Index and Bloom): 4.61 + 3.9GB Sum: Approximately 9,8GB

All together, taking the worst case scenario, this is 20 GB.

Actual behavior

RocksDB takes way more memory than expected and finally eats up all memory. The memory consumption is far above expectations: Even shortly after start it requires 33GB RSS.

Detail: Swap is filled before buffers/cache. As you see below the swap is full (7GB) but there is still lots of data cached (14GB), so I guess RocksDB is pinning data in memory. Memory loss happens during reading from the DB via prefix iterator (seek, next), because when just writing (on the average 100MB/s to SSD) we do not lose any memory.

dbdir # free -m total used free shared buff/cache available Mem: 64306 49196 293 12 14816 14440 Swap: 7710 7710 0

RocksDB is being used embedded in Java via rocksdbjni. The Java part requires less than 1GB as Heap. Memory usage of that process, as taken from top: VIRT 0.450t RES 0.048t SHR 0.0.12t

I have run pmap on the process, and summed up RSS for the .sst files: The sum for .sst is 34GB.

Steps to reproduce the behavior

We can reproduce the behavior in our application quickly. It never happens if we just write. As soon as we allow clients to read, the memory loss happens. The only operations on this DB are put, prefix-iterator seek() + next(), and deleteRange. One important observation: We use itertators for a short time, then close them to avoid resource locking by the iterator and create a new Iterator. If we chose a shorter time until discarding an Iterator, then we create more iterators and the memory loss is faster.

christian-esken commented 6 years ago

Additional info: I tried with deactivating block cache via table_options.setNoBlockCache(true) , and found out that this does not change anything. The process still eats up memory.

ajkr commented 6 years ago

Have you tried https://github.com/facebook/rocksdb/blob/master/include/rocksdb/utilities/memory_util.h ? I am not sure if it gives a complete picture but may be worth trying.

christian-esken commented 6 years ago

I saw that that but could not find it on the Java Layer. Did I miss it? I also already tried to get memory usage/estimation via calls to db.getProperty("rocksdb.estimate-table-readers-mem") and similar but only get back 0 or low values like 230. I can only guess that the latter is a bug or omission in the Rocks Java layer. Here is an output. I am always getting the same values.

"rocksdb.estimate-table-readers-mem" => 0
"rocksdb.size-all-mem-tables" => 696
"rocksdb.cur-size-all-mem-tables" => 696

I am pretty sure my code uses getProperty() correctly, as other values vary. For example "rocksdb.num-running-compactions" is typically 0 or 1.

I might try to work around that. Can I open the same DB in read-only mode with a C++ program while the Java program is running the DB r/w. Can I then get the memory statistics, or won't they be supplied in read-only mode? (Update: I now do presume that the statistics are per client and thus won't tell me much).

christian-esken commented 6 years ago

I found how to properly get the memory statistics on the Java layer by digging in the JNI code. I will gather statistics with it and report back. edit: If someone stumbles across this and also has a problem with memory statistics. The solution was to use the overlaoded getProperty() to specify the correct column family.

christian-esken commented 6 years ago

"rocksdb.size-all-mem-tables" => 1234436376 (1,2GB) "rocksdb.cur-size-all-mem-tables" => 89890656 (0.1GB) Pinned memtables is (IMO) the difference: => 1234436376 - 89890656 = 1.144.545.720 (1,1GB)

The values above are no surprise. Everything fine with memtables. :-)

christian-esken commented 6 years ago

"rocksdb.estimate-table-readers-mem" => 35045385959 (35GB) "rocksdb.estimate-num-keys" => 9659039471 (9659 Million = 9,7 Billion)

These estimates are surprising to me.

An estimation of nearly 10 Billion keys? My hand calculated estimation was 3 Billion as worst-case. (600GB DB with 200 Byte per value = 3 Billion)
35GB for "estimate-table-readers-mem"? I wonder for what. Even with 10 Billion keys I am calculating half (17,6 GB) for index (5,7GB) and bloom (11,9GB).

christian-esken commented 6 years ago

I am wondering whether the key estimations could be very wrong - both mine and the one from RocksDB. MIne is based on DB size and average UNCOMPRESSED value size. Thus the number of keys could be actually massively higher.

christian-esken commented 6 years ago

Here is a graph on the memory usage of this RocksDB based service.

Red = Used
Yellow = Cached
Green = Free
Swap is not in the graph, but it usually is eaten up within 1h after start of the service

memory on rocksdb prefix-seek and deleterange

Timeline

11/30 16:00 Start of service, configured to use Iterators for a short amount of time =>Many iterators are created. I see much memory loss
12/1 Some manual restarts. At 16:00 starting the service, configured to use Iterators for a longer amount of time =>Fewer iterators are created. I see less memory loss
12/3 and after: Two more automatic restarts, as memory ran out

I will give the service more headroom by adding memory to the machine next week. After that I will post information whether memory consumption stops at some point or if it is a real leak.

yiwu-arbug commented 6 years ago

Since table reader is using large amount of memory, have you try setting cache_index_and_filter_blocks=true, which makes the memory used for index and filters bounded by block cache size?

The rocksdb.estimate-num-keys can be very wrong. It uses a simple formula (total keys - 2 * total deletes) and doesn't take duplicated keys into consideration.

christian-esken commented 6 years ago

cache_index_and_filter_blocks is currently false. For details read the rest of this comment.

I used cache_index_and_filter_blocks=true until migrating to RocksDB 5.7.2. After upgrade to 5.7.2 our read latency increased from 30ms to 5 minutes (yes: MINUTES!). I never found what is the root cause of the read problems and presumed there was a regression for our use case in 5.7.2. The issue was fully reproducible on the same DB by changing back and forth between 5.4.5 and 5.7.2 (5.4.5 => OK, 5.7.2 => Issue).

This could be fixed by using cache_index_and_filter_blocks=false. I haven't dared yet to test to enable it again.

christian-esken commented 6 years ago

About the memory usage: I have shut down all other processes on the machine. RocksDB began eating more memory and then stopped at 57GB RSS (DB-Files on disk are 784GB). The SWAP is not completely filled, and that could be a good sign - possibly the allocations are not endless.

# free -m
              total        used        free      shared  buff/cache   available
Mem:          64306       48694         323           2       15288       14954
Swap:          7710         751        6959

I also did a breakdown on memory consumption on the process. I ran pmap -X and grouped the memory consumption. Resident in RAM (RES) are

16,78GB mmap'ed *.sst files
15,23GB HEAP (this should be the native heap, but exactly what is that? We also have plenty of anonymous allocations)
30,39GB anonymous allocations
- 14GB should be the Java heap (this is not 100% confirmed, but highly likely correct)
- 16GB (the rest) should be native RocksDB allocations

Summarized this should be for RocksDB 16GB +15,23 GB + 16,78 = 48GB

ANON           : size    47,61GB, rss    30,39GB, rssReferenced    30,00GB, anonymous    30,39GB, swap     5,37MB, locked         0B
FILE           : size     2,15MB, rss    32,77kB, rssReferenced    32,77kB, anonymous         0B, swap         0B, locked         0B
FILE_ROCKS_SST : size   800,58GB, rss    16,78GB, rssReferenced    16,42GB, anonymous         0B, swap         0B, locked         0B
HEAP           : size    15,25GB, rss    15,23GB, rssReferenced    15,10GB, anonymous    15,23GB, swap         0B, locked         0B
JAVA           : size     8,19kB, rss     4,10kB, rssReferenced     4,10kB, anonymous     4,10kB, swap         0B, locked         0B
LIBRARY_JAR    : size     4,54MB, rss         0B, rssReferenced         0B, anonymous         0B, swap         0B, locked         0B
LIBRARY_SO     : size    68,74MB, rss     9,90MB, rssReferenced     9,76MB, anonymous   851,97kB, swap   327,68kB, locked         0B
STACK          : size   139,26kB, rss    12,29kB, rssReferenced    12,29kB, anonymous    12,29kB, swap    24,58kB, locked         0B
VMISC          : size    20,48kB, rss     4,10kB, rssReferenced     4,10kB, anonymous         0B, swap         0B, locked         0B

siying commented 6 years ago

How large are your index and bloom filter blocks and how large is your block cache?

yiwu-arbug commented 6 years ago

From the symptoms you experience, it seems like you have huge index or bloom filter in your DB. With cache_index_and_filter_blocks=false, the index and filter blocks will be loaded in heap, eating memory; with cache_index_and_filter_blocks=true, the huge blocks have chances to be evict from cache, contribute to large read latency when they are being load again.

You can find out index and filter size by running sst_dump --file=<sst_file> --command=noop --show_properties with sst files in your DB.

christian-esken commented 6 years ago

This sst_dump is from an old *.sst file which is still present in the DB and uses large values (200 bytes).

du -sh 023043.sst
268M    023043.sst

./sst_dump --file=023043.sst --command=noop --show_properties
from [] to []
Process 023043.sst
Sst file format: block-based
Table Properties:
------------------------------
  # data blocks: 271525
  # entries: 4845731
  raw key size: 155063392
  raw average key size: 32.000000
  raw value size: 973971030
  raw average value size: 200.995687
  data block size: 268436030
  index block size: 10532611
  filter block size: 7950387
  (estimated) table size: 286919028
  filter policy name: rocksdb.BuiltinBloomFilter
  column family ID: 1
  column family name: cache
  comparator name: leveldb.BytewiseComparator
  merge operator name: nullptr
  property collectors names: []
  SST file compression algo: Snappy
  creation time: 1509103180
  time stamp of earliest key: 0
  # deleted keys: 0
  # merge operands: 0
Raw user collected properties
------------------------------
  # rocksdb.block.based.table.index.type: 0x00000000
  # rocksdb.block.based.table.prefix.filtering: 0x31
  # rocksdb.block.based.table.whole.key.filtering: 0x31
  # rocksdb.deleted.keys: 0x00
  # rocksdb.merge.operands: 0x00

christian-esken commented 6 years ago

This sst_dump is from a new *.sst file which is still present in the DB (Level L0) and uses small values (26 bytes).

du -sh 690256.sst 
122M    690256.sst

./sst_dump --file=690256.sst --command=noop --show_properties
from [] to []
Process 690256.sst
Sst file format: block-based
Table Properties:
------------------------------
  # data blocks: 15092
  # entries: 2678161
  raw key size: 85701152
  raw average key size: 32.000000
  raw value size: 69396830
  raw average value size: 25.912120
  data block size: 123085798
  index block size: 577665
  filter block size: 4017349
  (estimated) table size: 127680812
  filter policy name: rocksdb.BuiltinBloomFilter
  column family ID: 1
  column family name: cache
  comparator name: leveldb.BytewiseComparator
  merge operator name: nullptr
  property collectors names: []
  SST file compression algo: NoCompression
  creation time: 1512734971
  time stamp of earliest key: 1512734901
  # deleted keys: 0
  # merge operands: 0
Raw user collected properties
------------------------------
  # rocksdb.block.based.table.index.type: 0x00000000
  # rocksdb.block.based.table.prefix.filtering: 0x31
  # rocksdb.block.based.table.whole.key.filtering: 0x31
  # rocksdb.deleted.keys: 0x00
  # rocksdb.merge.operands: 0x00

christian-esken commented 6 years ago

I see that filter + index seem to use a "sensbile" amount in the sst file, as shown below. My new data format fares better (smaller values 26 vs 200 bytes, 8KB vs 4KB blocks). This is file usage, but how big would be the data in RAM? It cannot be literally translated. IIRC the filter block is compressed, while index block is not. I guess that 10% of whole SST table size should be more enough for filters and index according to my former calculation and the sst_dump.

	023043.sst (old values)	Percentage of the sum	690256.sst (new values)	Percentage of the sum
data_block_size	268436030	93,5581135455401	123085798	96,4011710702466
index_block_size	10532611	3,67093499285101	577665	0,452428983612667
filter_block_size	7950387	2,77095146160888	4017349	3,14639994614069
Sum of Filter + Index	18482998	6,4418864544599	4595014	3,59882892975336
Sum of the 3 blocks	286919028	100	127680812	100

christian-esken commented 6 years ago

The machine with the service in question was upgraded from 64GB to 256GB today, and now after an hour the process takes 178GB RSS.

christian-esken commented 6 years ago

I do not see over-large index or filters. For example for 690256.sst the size is slightly higher than calculated, but only 20%:

Index
- Index expected size: 15092 (data_blocks) * 32 (avg key size) = 482944
- index actual size from sst_dump statistics: 577665 (20% bigger than expected).
Bloom filters (6 bits per entry):
- Bloom filters expected size: 2678161 (entries) * 6 (bloom bits) / 8 = 3570881
- Bloom filters from sst_dump statistics: 4017349 (12,5% bigger)

Also from statistics I see not much pinned: MemtablesSize = 157458080; // 157MB MemtableUnflushedSize = 34534368; // 34MB MemtablePinnedSize = 122923712; // 123 MB EstimatedKeyCount = 15872519822; // 15,8 Billion BloomAndFilterSizeEstimation = 45643281266; // 45GB

christian-esken commented 6 years ago

@siying

Block filter is disabled, to make size calculations easier.
index and bloom filter blocks:
- For in-memory-usage there is the RocksDB estimatom of 45GB.
- In *.sst files I added two examples (Search for 023043.sst and 690256.sst).

christian-esken commented 6 years ago

For comparison reasons I started to run two differently configured instances. I will check after the weekend how they are performing.

One with no block cache: table_options.setNoBlockCache(true);

2017/12/08-11:11:48.747656 7f3cf31f6700            table_factory options:   flush_block_policy_factory: FlushBlockBySizePolicyFactory (0x7f3ced495490)
cache_index_and_filter_blocks: 0
cache_index_and_filter_blocks_with_high_priority: 0
pin_l0_filter_and_index_blocks_in_cache: 0
index_type: 0
hash_index_allow_collision: 1
checksum: 1
no_block_cache: 1
block_cache: (nil)
block_cache_compressed: (nil)
persistent_cache: (nil)
block_size: 8192
block_size_deviation: 10
block_restart_interval: 16
index_block_restart_interval: 1
filter_policy: rocksdb.BuiltinBloomFilter
whole_key_filtering: 1
format_version: 0

One with block cache:

        table_options
            .setPinL0FilterAndIndexBlocksInCache(true)
            .setCacheNumShardBits(globals.getInt("rocks.table.cacheNumShardBits", 6))
            .setNoBlockCache(false);
            table_options.setCacheIndexAndFilterBlocks(true);
            table_options.setBlockCacheSize(blockCacheSize)); // 2.5 GiB

2017/12/08-14:53:36.977174 7f7bb5115700            table_factory options:   flush_block_policy_factory: FlushBlockBySizePolicyFactory (0x7f7bad36dd60)
  cache_index_and_filter_blocks: 1
  cache_index_and_filter_blocks_with_high_priority: 0
  pin_l0_filter_and_index_blocks_in_cache: 1
  index_type: 0
  hash_index_allow_collision: 1
  checksum: 1
  no_block_cache: 0
  block_cache: 0x7f7bad36d750
  block_cache_name: LRUCache
  block_cache_options:
    capacity : 1610612736
    num_shard_bits : 6
    strict_capacity_limit : 0
    high_pri_pool_ratio: 0.000
  block_cache_compressed: (nil)
  persistent_cache: (nil)
  block_size: 8192
  block_size_deviation: 10
  block_restart_interval: 16
  index_block_restart_interval: 1
  filter_policy: rocksdb.BuiltinBloomFilter
  whole_key_filtering: 1
  format_version: 0

christian-esken commented 6 years ago

The over-the-weekend test is done. Differences are blatant. The amount of "OS buffered/cached" is nearly identical, but the used and free values vary heavily:

Used: 12227 vs 51232
Free: 14240 vs 904

With block cache:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:         225586       12277       14240          17      199068      211739
Swap:          7710          74        7636

Without block cache:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:         257842       51232         904          55      205705      204747
Swap:          7710         272        7438

christian-esken commented 6 years ago

My pmap analysis shows:

ANON
- With block cache: using less ANON (10,76GB). This is expected, as the Cache should be ANON allocations.
- Without block cache: using 47,67GB ANON. This is acceptable for a DB of 1,2TB
FILE_ROCKS_SST
- With block cache: using 18,72GB rss / rssReferenced. This is higher than expected, because there are no (application) reads/seeks on the DB.
- Without block cache: using 112,74GB rss / rssReferenced. This is very high.

With block cache:

	size	rss	rssReferenced	anonymous	swap	locked
ANON	55,63GB,	10,76GB,	10,60GB,	10,76GB,	0B,	0B
FILE	2,15MB,	397,31kB,	397,31kB,	0B,	0B,	0B
FILE_ROCKS_SST	809,85GB,	18,72GB,	18,72GB,	0B,	0B,	0B
HEAP	143,84MB,	136,54MB,	136,54MB,	136,54MB,	0B,	0B
JAVA	8,19kB,	8,19kB,	8,19kB,	4,10kB,	0B,	0B
LIBRARY_JAR	4,55MB,	798,72kB,	790,53kB,	0B,	0B,	0B
LIBRARY_SO	68,74MB,	16,78MB,	16,45MB,	1,18MB,	0B,	0B
STACK	139,26kB,	36,86kB,	12,29kB,	36,86kB,	0B,	0B
VMISC	20,48kB,	4,10kB,	4,10kB,	0B,	0B,	0B

Without block cache:

	size	rss	rssReferenced	anonymous	swap	locked
ANON	79,57GB,	49,18GB,	47,67GB,	49,18GB,	0B,	0B
FILE	2,15MB,	434,18kB,	434,18kB,	0B,	0B,	0B
FILE_ROCKS_SST	1260,80GB,	112,74GB,	112,74GB,	0B,	0B,	0B
HEAP	2,16GB,	2,16GB,	2,14GB,	2,16GB,	0B,	0B
JAVA	8,19kB,	8,19kB,	8,19kB,	4,10kB,	0B,	0B
LIBRARY_JAR	4,54MB,	2,29MB,	2,29MB,	0B,	0B,	0B
LIBRARY_SO	68,74MB,	22,18MB,	21,82MB,	1,18MB,	0B,	0B
STACK	139,26kB,	40,96kB,	12,29kB,	40,96kB,	0B,	0B
VMISC	20,48kB,	4,10kB,	4,10kB,	0B,	0B,	0B

christian-esken commented 6 years ago

Summary

ANON allocations look fine. They seem to be in line with the RocksDB memory usage wiki page. FILE_ROCKS_SST are the SST files mapped into memory, and I see an issue here. mmap-ing into memory is not an issue, but very much is in RAM (as seen in the rss column). I assume that any file block read at least ONCE stays in RAM forever, and the reason is that it is in the OS-Cache.

Questions

Is my assumption right that every block from a file stays in RAM that was read at least one time?
Why would the OS (Linux) prefer swapping out other applications instead of dropping blocks from its own cache. This is my biggest concern, and I have no answer to that. /proc/sys/vm/swappiness is 0, so Linux should really try its best to not swap.
Is there a way to "unreference" cold blocks, so that the OS could easier reclaim that memory? For example unmap parts of the file. Does RocksDB do anything like this?

christian-esken commented 6 years ago

I observed the behavior for a week, digged deep into Linux memory management and got insight in the issue. Answering my own questions now:

Is my assumption right that every block from a file stays in RAM that was read at least one time? => The answer is: Yes (until the OS discards it from block cache)
Why would the OS (Linux) prefer swapping out other applications instead of dropping blocks from its own cache. This is my biggest concern, and I have no answer to that. /proc/sys/vm/swappiness is 0, so Linux should really try its best to not swap. => The answer is: cgroups can have different swappiness, and newer Debian versions make use of it.
Is there a way to "unreference" cold blocks, so that the OS could easier reclaim that memory? For example unmap parts of the file. Does RocksDB do anything like this? => he answer is: munmap() could do it, but it would likely be a big effort in RocksDB to do that and not worth

christian-esken commented 6 years ago

Swap usage: The mystery of SWAP being used though there is free memory is solved. Debian 9 puts services in a cgroup, and that cgroup does not follow /proc/sys/vm/swappiness . If others have similar issues, here is how you find your service/cgroup specific settings:

# cat /proc/1052/cgroup | grep memory
4:memory:/system.slice/my-svc.service

# cat  /sys/fs/cgroup/memory/system.slice/my-svc.service/memory.swappiness
60

# cat  /proc/sys/vm/swappiness
0

We see that the default swappiness diverges from the swappiness of the cgroup.

christian-esken commented 6 years ago

Memory leakage: There may still be memory leakage, as I see that about 10GB memory is "lost" over the course of 6 days. I need to follow this over a longer period and see where and if memory usage settles. rocksdb memory usage on prefix-seek

christian-esken commented 6 years ago

Summarizing a long ticket

Swap is used because swappiness is (re)set by Debain 9 in a cgroup from 0 to 60. => "Watch out for cgroups". I hope this insight will help other people out there. :-)
Memory usage is way more than expectated and calculated. I think it is too high, and will observe it.
There may be a memory leak. If there is a leak, then it is likely in RocksDB (native memory) and not the (Java) application on top of it, as the Java heap is limited. I will observe it.

I am now closing this ticket, as the first aspect ("swap usage") is not a RocksDB issue. If I have more insight about memory consumption or leaks I will open a new ticket. Thanks for the help @ajkr and @yiwu-arbug

jianwei1216 commented 6 years ago

hi, nice to meet you

Do you resolve this problem?

christian-esken commented 6 years ago

I still see high memory usage, but I am not sure if it is a true leak. I am still in the process of finding root cause and proof. The most likely cause of high memory usage in my case are iterators. In my opinion Iterators hold way more memory than documented in section "Blocks pinned by iterators" on the page https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB.

If you want a possible answer, this is what can contribute to the problem of high memory usage.

Close your iterators as soon as possible. Otherwise memory usage can explode.
Don't pool iterators, unless you understood the effects e.g. from LSM tree versioning.

A more detailed description follows below.

Effectively Iterators likely hold:

Blocks in Caches
Memtables
Index and Bloom from all SSTables. This is omitted on the mentioned Wiki page, but is noted on a different one: https://github.com/facebook/rocksdb/wiki/How-we-keep-track-of-live-SST-files

The latter could make your memory usage explode. If you use an Iterator pool, there are 2 issues:

Iterators keep all SSTables open of the LSM tree version of the time of creation of the Iterator. In the worst case a single iterator has a completely different set of SSTables. Thus means, bloom and filters may require duplicate size. 10 iterators could mean 10 times the size.
The iterator from a pool may NOT see your most recent data, as the newer data is in a newer LSM tree version.

iFA88 commented 4 years ago

Hey @christian-esken, thanks for your very detailed debugging, is very interesting. I face similar issue when I use my DB with many CFs and 1+ trillion keys. Several times my DB memory usage increases for 20-23 GB(all what i have free) in few hours, very randomly. My issue is like what you describe, very slow memory increasing day for day. I have dumped the memory from the process, and i have several 256MB memory pointers filled up with zeroes. pmap returns for that these lines:

         Address Perm   Offset Device     Inode    Size     Rss     Pss Referenced Anonymous ShmemPmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked Mapping
    7f2130681000 rw-s 00000000  00:14 275866516       4       4       4          4         0              0              0               0    0       0      0 KaE2af (deleted)
    7f2130682000 rw-s 00000000  00:14 275866515       4       0       0          0         0              0              0               0    0       0      0 uoVzy9 (deleted)
    7f2130683000 rw-s 00000000  00:14 275866514       4       0       0          0         0              0              0               0    0       0      0 Li57V3 (deleted)
    7f2130684000 rw-s 00000000  00:14 275866513       4       0       0          0         0              0              0               0    0       0      0 Nw6GjY (deleted)
    7f2130685000 rw-s 00000000  00:14 275866512       4       4       4          4         0              0              0               0    0       0      0 2k4gHS (deleted)
    7f2130686000 rw-s 00000000  00:14 275866511       4       4       4          4         0              0              0               0    0       0      0 Fc7R4M (deleted)
    7f2130687000 rw-s 00000000  00:14 275866510       4       4       4          4         0              0              0               0    0       0      0 dZdusH (deleted)
    7f2130688000 rw-s 00000000  00:14 275866509       4       4       4          4         0              0              0               0    0       0      0 KZn7PB (deleted)
    7f2130689000 rw-s 00000000  00:14 275866508       4       4       4          4         0              0              0               0    0       0      0 wSvLdw (deleted)
    7f213068a000 rw-s 00000000  00:14 275866507       4       4       4          4         0              0              0               0    0       0      0 p0GqBq (deleted)
    7f213068b000 rw-s 00000000 103:08   3407899       4       4       0          4         0              0              0               0    0       0      0 pym-6301-pj6joa5c (deleted)
    7f213068c000 rw-p 00000000  00:00         0    1816    1816    1409       1748      1816              0              0               0    0       0      0 
    7f2130852000 rwxp 00000000  00:00         0       4       4       0          0         4              0              0               0    0       0      0 
    7f2130853000 rw-s 00000000  00:14 237344868       4       4       0          4         0              0              0               0    0       0      0 g3ghBt (deleted)
    7f2130854000 rw-s 00000000  00:14 237344867       4       4       4          4         0              0              0               0    0       0      0 LpPrSA (deleted)
    7f2130855000 rw-s 00000000  00:14 237344866       4       4       4          4         0              0              0               0    0       0      0 M7ZC9H (deleted)
    7f2130856000 rw-s 00000000  00:14 237344865       4       4       4          4         0              0              0               0    0       0      0 VXOOqP (deleted)
    7f2130857000 rw-s 00000000  00:14 237344864       4       4       0          4         0              0              0               0    0       0      0 75b1HW (deleted)
    7f2130858000 rw-s 00000000  00:14 237344863       4       4       4          4         0              0              0               0    0       0      0 7QReZ3 (deleted)

I think that was used by the compiler. But I don't know why not free it... My most CF's have option no_block_cache=true I use python layer for the RocksDB (version: 6.8.1).

For your suggestions: I do iterate very rare and after use is closed.

linas commented 3 years ago

FYI, RAM usage seems to be directly correlated with the number of *.sst files. RAM usage can be clamped by limiting the number of open *.sst files, by setting options max_open_files to a suitable number. I am seeing approx 1GB RAM for every 15 sst files. RAM usage appears to be about 3x disk usage.

Details:

Ubuntu focal 20.04 + rocksdb 5.17
My app is in C++.
I'm rocks-naive, so am using mostly-all default settings, no tuning, no special options, nothing sophisticated. Just blunt read-write of mostly short and some long records. (Zipf distribution, short being most common) Almost no iterators used.
My database is about 500 MB (half a GB) containing about a million records.
Without resource constraints, it takes about 15 minutes to shoot up to about 60GB disk usage, 180GB RAM usage and about 1700 sst files.
Closing the app triggers compaction, and the database compacts back down to it's original size.
When options.max_open_files = -1 (the rocks default) then the number of sst files shoots up to just under the ulimit -n setting (1024 is the bash default), after which I get silent data corruption. Closing the app, and re-opening it again seems to give a database with the pre-corruption data in it.
Setting ulimit -n 4096 allows more open sst files, and vast amounts of RAM and disk to be consumed.
Setting options.max_open_files = 312 seems to limit the number of sst files to about 170, limits disk usage to about 6.6GB, and RAM usage to about 20GB.

The root cause of the large explosion of the number of sst file appears to be due to the app performing a rapid number of edits to some large values. The values in the 50KB range, and these are being updated hundreds of times a second. This appears to cause rocks to write out tens of MBytes/sec of log records, which turns into the GB/minute or so increase in file storage space. I don't understand why compaction isn't mashing these back down to a manageable size.

The fact that RAM is 3x the disk usage suggests that each sst file is memory-mapped three times!?

My app also has a postgres backend, so I can compare. In synthetic benchmarks, rocks is 2x or 3x faster than postgres. Yayy! That's a lot! For the production data runs, its about 10x slower. Booo! I don't understand the root cause of this, either.

mrambacher commented 3 years ago

What is the symptom of the "silent corruption"? When you close and re-open the DB, is the original data there or is it still "corrupted"?

I am not that familiar with RocksDB 5.17 so I have no idea if anything has been fixed in this area since that time.

linas commented 3 years ago

What is the symptom of the "silent corruption"?

Data that I expect to be present goes missing, as if the entire record is absent. To catch it in the act, I also wrote a log of everything written to as an ascii file. I can see that a write occurs to that key, with the correct data, whereas a later read returns nothing. At this point, the app is check-stopped, and I can manually verify that the expected key-value pair really is missing.

When you close and re-open the DB, is the original data there or is it still "corrupted"?

When I exit the app and restart, I can see that the the original write had gone through, as it is now holding the expected data. I cannot close and reopen without exiting the app, because, despite closing, there are 980 file descriptors that remain open to the 980 sst files, and ulimit prevents the open from succeeding, with a "too many open files" error. My guess is that rocksdb hits the ulimit max open files limit, doesn't notice, stops returning data to Get() and iterators. Then, upon close, it forgets that it has all these other files open, and leaves them open.

A more elegant response would have been to run a compaction cycle and reduce the number of open files, but perhaps it is trying to avoid error amplification, and just refusing to do anything more.

I hadn't noticed the RAM usage until I'd increased ulimit -n 4096 after which I ran out of RAM (on a 256 GB box ... while read-modify-writing a 500MB db... Wow!)

linas commented 3 years ago

The leaking file descriptor problem persists into rockdb version 6.19 github master branch as of yesterday. After closing rocksdb, lsof -p pid |grep sst | wc shows that there are 71 file descriptors still open. Reopening and closing again causes compaction to run, which removes many of these files; lsof -p pid |grep sst |grep deleted |wc shows that there are 67 open file descriptors to sst files that have been deleted. Leaving the application idle overnight (12 hours) does not change things; those descriptors appear to have been permanently lost.

siying commented 3 years ago

Any change you have something holding old files from being deleted, e.g. very old iterators not destroyed, some scheduled compactions not being run for compaction scheduling, etc? Some DB properties to check: "rocksdb.num-live-versions" - old objects that might hold old files "rocksdb.total-sst-files-size" and "rocksdb.live-sst-files-size". The later is Live SST file in the latest LSM-tree, while the first one includes stale files holding by something.

linas commented 3 years ago

@siying Wow! Thank you! I'm coding in C++ and made the apparently faulty and foolish assumption that iterators were smart pointers that self-deleted when they went out of block scope. Looking at <rocksdb/db.h> I see that is not the case. Yikes! Thank you, yes, you are right, all my ram/disk problems are almost surely due to failure to delete iterators. I will remeasure and report findings shortly.... done. Wow! all RAM and file usage problems appear to be completely resolved!

Its been a decade since I last coded in a C++ project that required an explicit delete. Never even occurred to me. Based on my 100% newbie error, I humbly suggest that future versions of the C++ bindings to rocks might convert to using smart pointers for iterators. In C++, its not hard, a few lines of code in the header files. Mem usage of smart pointers is about 3 long words. CPU overhead is low: under the covers, they increment a counter when created, they decrement when the pointer goes out of scope. The only issue is how to make this change in a backwards-compatible way: all the old users will be calling delete, which will confuse the shared-pointer control block when it also goes to delete. Hmm.

lcoon815 commented 2 years ago

@linas How to set options.max_open_files = 300 ? I am facing error-> put error: IO error: While open a file for appending: /root/pmem1/db/001021.sst: Too many open files

Would like to try your method if it can solve my problem. Thanks.

linas commented 2 years ago

@linas How to set options.max_open_files = 300 ? I am facing error-> put error: IO error: While open a file for appending: /root/pmem1/db/001021.sst: Too many open files

Try ulimit -a at the bash prompt. Also man getrlimit

lcoon815 commented 2 years ago

@linas How to set options.max_open_files = 300 ? I am facing error-> put error: IO error: While open a file for appending: /root/pmem1/db/001021.sst: Too many open files

Try ulimit -a at the bash prompt. Also man getrlimit

I try to increase the limit: ulimit -n 16384. But it seems like the benchmark.sh can keep running until the disk is full (501GB used), then it errors out. put error: IO error: No space left on deviceWhile appending to file: /data/pmem0/db/001049.sst: No space left on device

lcoon815 commented 2 years ago

@linas Could you guide me how to set options.max_open_files = 300?

siying commented 2 years ago

@linas How to set options.max_open_files = 300 ? I am facing error-> put error: IO error: While open a file for appending: /root/pmem1/db/001021.sst: Too many open files

Try ulimit -a at the bash prompt. Also man getrlimit

I try to increase the limit: ulimit -n 16384. But it seems like the benchmark.sh can keep running until the disk is full (501GB used), then it errors out. put error: IO error: No space left on deviceWhile appending to file: /data/pmem0/db/001049.sst: No space left on device

Default benchmark.sh appears to work on a data set of 1.5TB. With some compaction styles it can go as high as 3TB. If you are OK with running smaller size, you can overwrite number of keys with environment variable NUM_KEYS. Even our official benchmark now runs with much smaller number of keys: https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks

lcoon815 commented 2 years ago

@linas How to set options.max_open_files = 300 ? I am facing error-> put error: IO error: While open a file for appending: /root/pmem1/db/001021.sst: Too many open files

Try ulimit -a at the bash prompt. Also man getrlimit

I try to increase the limit: ulimit -n 16384. But it seems like the benchmark.sh can keep running until the disk is full (501GB used), then it errors out. put error: IO error: No space left on deviceWhile appending to file: /data/pmem0/db/001049.sst: No space left on device

Default benchmark.sh appears to work on a data set of 1.5TB. With some compaction styles it can go as high as 3TB. If you are OK with running smaller size, you can overwrite number of keys with environment variable NUM_KEYS. Even our official benchmark now runs with much smaller number of keys: https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks

@siying Okay, i will try to reduce the NUM_KEYS. Do you know how to calculate so that we can know what number of keys the disk can take with its space?

Robin3910 commented 6 months ago

I want to konw rocksdb itself close the iterators or not? I found the memory grow unlimitly and never decrese. I didn't use any iterators in my code and I just use the Get() interface only, so I guess rocksdb use lots of iterators by Get(), and not close it.

facebook / rocksdb

RocksDB massively exceeds memory limits. Potential memory leak #3216

Expected behavior

Actual behavior

Steps to reproduce the behavior

Summary

Questions

A more detailed description follows below.