Closed christian-esken closed 6 years ago
Additional info: I tried with deactivating block cache via table_options.setNoBlockCache(true) , and found out that this does not change anything. The process still eats up memory.
Have you tried https://github.com/facebook/rocksdb/blob/master/include/rocksdb/utilities/memory_util.h ? I am not sure if it gives a complete picture but may be worth trying.
I saw that that but could not find it on the Java Layer. Did I miss it? I also already tried to get memory usage/estimation via calls to db.getProperty("rocksdb.estimate-table-readers-mem") and similar but only get back 0 or low values like 230. I can only guess that the latter is a bug or omission in the Rocks Java layer. Here is an output. I am always getting the same values.
"rocksdb.estimate-table-readers-mem" => 0
"rocksdb.size-all-mem-tables" => 696
"rocksdb.cur-size-all-mem-tables" => 696
I am pretty sure my code uses getProperty() correctly, as other values vary. For example "rocksdb.num-running-compactions"
is typically 0 or 1.
I might try to work around that. Can I open the same DB in read-only mode with a C++ program while the Java program is running the DB r/w. Can I then get the memory statistics, or won't they be supplied in read-only mode? (Update: I now do presume that the statistics are per client and thus won't tell me much).
I found how to properly get the memory statistics on the Java layer by digging in the JNI code. I will gather statistics with it and report back. edit: If someone stumbles across this and also has a problem with memory statistics. The solution was to use the overlaoded getProperty() to specify the correct column family.
"rocksdb.size-all-mem-tables" => 1234436376 (1,2GB) "rocksdb.cur-size-all-mem-tables" => 89890656 (0.1GB) Pinned memtables is (IMO) the difference: => 1234436376 - 89890656 = 1.144.545.720 (1,1GB)
The values above are no surprise. Everything fine with memtables. :-)
"rocksdb.estimate-table-readers-mem" => 35045385959 (35GB) "rocksdb.estimate-num-keys" => 9659039471 (9659 Million = 9,7 Billion)
These estimates are surprising to me.
I am wondering whether the key estimations could be very wrong - both mine and the one from RocksDB. MIne is based on DB size and average UNCOMPRESSED value size. Thus the number of keys could be actually massively higher.
Here is a graph on the memory usage of this RocksDB based service.
Timeline
I will give the service more headroom by adding memory to the machine next week. After that I will post information whether memory consumption stops at some point or if it is a real leak.
Since table reader is using large amount of memory, have you try setting cache_index_and_filter_blocks=true, which makes the memory used for index and filters bounded by block cache size?
The rocksdb.estimate-num-keys can be very wrong. It uses a simple formula (total keys - 2 * total deletes) and doesn't take duplicated keys into consideration.
cache_index_and_filter_blocks is currently false. For details read the rest of this comment.
I used cache_index_and_filter_blocks=true until migrating to RocksDB 5.7.2. After upgrade to 5.7.2 our read latency increased from 30ms to 5 minutes (yes: MINUTES!). I never found what is the root cause of the read problems and presumed there was a regression for our use case in 5.7.2. The issue was fully reproducible on the same DB by changing back and forth between 5.4.5 and 5.7.2 (5.4.5 => OK, 5.7.2 => Issue).
This could be fixed by using cache_index_and_filter_blocks=false. I haven't dared yet to test to enable it again.
About the memory usage: I have shut down all other processes on the machine. RocksDB began eating more memory and then stopped at 57GB RSS (DB-Files on disk are 784GB). The SWAP is not completely filled, and that could be a good sign - possibly the allocations are not endless.
# free -m
total used free shared buff/cache available
Mem: 64306 48694 323 2 15288 14954
Swap: 7710 751 6959
I also did a breakdown on memory consumption on the process. I ran pmap -X and grouped the memory consumption. Resident in RAM (RES) are
Summarized this should be for RocksDB 16GB +15,23 GB + 16,78 = 48GB
ANON : size 47,61GB, rss 30,39GB, rssReferenced 30,00GB, anonymous 30,39GB, swap 5,37MB, locked 0B
FILE : size 2,15MB, rss 32,77kB, rssReferenced 32,77kB, anonymous 0B, swap 0B, locked 0B
FILE_ROCKS_SST : size 800,58GB, rss 16,78GB, rssReferenced 16,42GB, anonymous 0B, swap 0B, locked 0B
HEAP : size 15,25GB, rss 15,23GB, rssReferenced 15,10GB, anonymous 15,23GB, swap 0B, locked 0B
JAVA : size 8,19kB, rss 4,10kB, rssReferenced 4,10kB, anonymous 4,10kB, swap 0B, locked 0B
LIBRARY_JAR : size 4,54MB, rss 0B, rssReferenced 0B, anonymous 0B, swap 0B, locked 0B
LIBRARY_SO : size 68,74MB, rss 9,90MB, rssReferenced 9,76MB, anonymous 851,97kB, swap 327,68kB, locked 0B
STACK : size 139,26kB, rss 12,29kB, rssReferenced 12,29kB, anonymous 12,29kB, swap 24,58kB, locked 0B
VMISC : size 20,48kB, rss 4,10kB, rssReferenced 4,10kB, anonymous 0B, swap 0B, locked 0B
How large are your index and bloom filter blocks and how large is your block cache?
From the symptoms you experience, it seems like you have huge index or bloom filter in your DB. With cache_index_and_filter_blocks=false, the index and filter blocks will be loaded in heap, eating memory; with cache_index_and_filter_blocks=true, the huge blocks have chances to be evict from cache, contribute to large read latency when they are being load again.
You can find out index and filter size by running sst_dump --file=<sst_file> --command=noop --show_properties
with sst files in your DB.
This sst_dump is from an old *.sst file which is still present in the DB and uses large values (200 bytes).
du -sh 023043.sst
268M 023043.sst
./sst_dump --file=023043.sst --command=noop --show_properties
from [] to []
Process 023043.sst
Sst file format: block-based
Table Properties:
------------------------------
# data blocks: 271525
# entries: 4845731
raw key size: 155063392
raw average key size: 32.000000
raw value size: 973971030
raw average value size: 200.995687
data block size: 268436030
index block size: 10532611
filter block size: 7950387
(estimated) table size: 286919028
filter policy name: rocksdb.BuiltinBloomFilter
column family ID: 1
column family name: cache
comparator name: leveldb.BytewiseComparator
merge operator name: nullptr
property collectors names: []
SST file compression algo: Snappy
creation time: 1509103180
time stamp of earliest key: 0
# deleted keys: 0
# merge operands: 0
Raw user collected properties
------------------------------
# rocksdb.block.based.table.index.type: 0x00000000
# rocksdb.block.based.table.prefix.filtering: 0x31
# rocksdb.block.based.table.whole.key.filtering: 0x31
# rocksdb.deleted.keys: 0x00
# rocksdb.merge.operands: 0x00
This sst_dump is from a new *.sst file which is still present in the DB (Level L0) and uses small values (26 bytes).
du -sh 690256.sst
122M 690256.sst
./sst_dump --file=690256.sst --command=noop --show_properties
from [] to []
Process 690256.sst
Sst file format: block-based
Table Properties:
------------------------------
# data blocks: 15092
# entries: 2678161
raw key size: 85701152
raw average key size: 32.000000
raw value size: 69396830
raw average value size: 25.912120
data block size: 123085798
index block size: 577665
filter block size: 4017349
(estimated) table size: 127680812
filter policy name: rocksdb.BuiltinBloomFilter
column family ID: 1
column family name: cache
comparator name: leveldb.BytewiseComparator
merge operator name: nullptr
property collectors names: []
SST file compression algo: NoCompression
creation time: 1512734971
time stamp of earliest key: 1512734901
# deleted keys: 0
# merge operands: 0
Raw user collected properties
------------------------------
# rocksdb.block.based.table.index.type: 0x00000000
# rocksdb.block.based.table.prefix.filtering: 0x31
# rocksdb.block.based.table.whole.key.filtering: 0x31
# rocksdb.deleted.keys: 0x00
# rocksdb.merge.operands: 0x00
I see that filter + index seem to use a "sensbile" amount in the sst file, as shown below. My new data format fares better (smaller values 26 vs 200 bytes, 8KB vs 4KB blocks). This is file usage, but how big would be the data in RAM? It cannot be literally translated. IIRC the filter block is compressed, while index block is not. I guess that 10% of whole SST table size should be more enough for filters and index according to my former calculation and the sst_dump.
023043.sst (old values) | Percentage of the sum | 690256.sst (new values) | Percentage of the sum | ||
---|---|---|---|---|---|
data_block_size | 268436030 | 93,5581135455401 | 123085798 | 96,4011710702466 | |
index_block_size | 10532611 | 3,67093499285101 | 577665 | 0,452428983612667 | |
filter_block_size | 7950387 | 2,77095146160888 | 4017349 | 3,14639994614069 | |
Sum of Filter + Index | 18482998 | 6,4418864544599 | 4595014 | 3,59882892975336 | |
Sum of the 3 blocks | 286919028 | 100 | 127680812 | 100 |
The machine with the service in question was upgraded from 64GB to 256GB today, and now after an hour the process takes 178GB RSS.
I do not see over-large index or filters. For example for 690256.sst the size is slightly higher than calculated, but only 20%:
Also from statistics I see not much pinned: MemtablesSize = 157458080; // 157MB MemtableUnflushedSize = 34534368; // 34MB MemtablePinnedSize = 122923712; // 123 MB EstimatedKeyCount = 15872519822; // 15,8 Billion BloomAndFilterSizeEstimation = 45643281266; // 45GB
@siying
For comparison reasons I started to run two differently configured instances. I will check after the weekend how they are performing.
One with no block cache: table_options.setNoBlockCache(true);
2017/12/08-11:11:48.747656 7f3cf31f6700 table_factory options: flush_block_policy_factory: FlushBlockBySizePolicyFactory (0x7f3ced495490)
cache_index_and_filter_blocks: 0
cache_index_and_filter_blocks_with_high_priority: 0
pin_l0_filter_and_index_blocks_in_cache: 0
index_type: 0
hash_index_allow_collision: 1
checksum: 1
no_block_cache: 1
block_cache: (nil)
block_cache_compressed: (nil)
persistent_cache: (nil)
block_size: 8192
block_size_deviation: 10
block_restart_interval: 16
index_block_restart_interval: 1
filter_policy: rocksdb.BuiltinBloomFilter
whole_key_filtering: 1
format_version: 0
One with block cache:
table_options
.setPinL0FilterAndIndexBlocksInCache(true)
.setCacheNumShardBits(globals.getInt("rocks.table.cacheNumShardBits", 6))
.setNoBlockCache(false);
table_options.setCacheIndexAndFilterBlocks(true);
table_options.setBlockCacheSize(blockCacheSize)); // 2.5 GiB
2017/12/08-14:53:36.977174 7f7bb5115700 table_factory options: flush_block_policy_factory: FlushBlockBySizePolicyFactory (0x7f7bad36dd60)
cache_index_and_filter_blocks: 1
cache_index_and_filter_blocks_with_high_priority: 0
pin_l0_filter_and_index_blocks_in_cache: 1
index_type: 0
hash_index_allow_collision: 1
checksum: 1
no_block_cache: 0
block_cache: 0x7f7bad36d750
block_cache_name: LRUCache
block_cache_options:
capacity : 1610612736
num_shard_bits : 6
strict_capacity_limit : 0
high_pri_pool_ratio: 0.000
block_cache_compressed: (nil)
persistent_cache: (nil)
block_size: 8192
block_size_deviation: 10
block_restart_interval: 16
index_block_restart_interval: 1
filter_policy: rocksdb.BuiltinBloomFilter
whole_key_filtering: 1
format_version: 0
The over-the-weekend test is done. Differences are blatant. The amount of "OS buffered/cached" is nearly identical, but the used and free values vary heavily:
With block cache:
$ free -m
total used free shared buff/cache available
Mem: 225586 12277 14240 17 199068 211739
Swap: 7710 74 7636
Without block cache:
$ free -m
total used free shared buff/cache available
Mem: 257842 51232 904 55 205705 204747
Swap: 7710 272 7438
My pmap analysis shows:
With block cache:
size | rss | rssReferenced | anonymous | swap | locked | |
---|---|---|---|---|---|---|
ANON | 55,63GB, | 10,76GB, | 10,60GB, | 10,76GB, | 0B, | 0B |
FILE | 2,15MB, | 397,31kB, | 397,31kB, | 0B, | 0B, | 0B |
FILE_ROCKS_SST | 809,85GB, | 18,72GB, | 18,72GB, | 0B, | 0B, | 0B |
HEAP | 143,84MB, | 136,54MB, | 136,54MB, | 136,54MB, | 0B, | 0B |
JAVA | 8,19kB, | 8,19kB, | 8,19kB, | 4,10kB, | 0B, | 0B |
LIBRARY_JAR | 4,55MB, | 798,72kB, | 790,53kB, | 0B, | 0B, | 0B |
LIBRARY_SO | 68,74MB, | 16,78MB, | 16,45MB, | 1,18MB, | 0B, | 0B |
STACK | 139,26kB, | 36,86kB, | 12,29kB, | 36,86kB, | 0B, | 0B |
VMISC | 20,48kB, | 4,10kB, | 4,10kB, | 0B, | 0B, | 0B |
Without block cache:
size | rss | rssReferenced | anonymous | swap | locked | |
---|---|---|---|---|---|---|
ANON | 79,57GB, | 49,18GB, | 47,67GB, | 49,18GB, | 0B, | 0B |
FILE | 2,15MB, | 434,18kB, | 434,18kB, | 0B, | 0B, | 0B |
FILE_ROCKS_SST | 1260,80GB, | 112,74GB, | 112,74GB, | 0B, | 0B, | 0B |
HEAP | 2,16GB, | 2,16GB, | 2,14GB, | 2,16GB, | 0B, | 0B |
JAVA | 8,19kB, | 8,19kB, | 8,19kB, | 4,10kB, | 0B, | 0B |
LIBRARY_JAR | 4,54MB, | 2,29MB, | 2,29MB, | 0B, | 0B, | 0B |
LIBRARY_SO | 68,74MB, | 22,18MB, | 21,82MB, | 1,18MB, | 0B, | 0B |
STACK | 139,26kB, | 40,96kB, | 12,29kB, | 40,96kB, | 0B, | 0B |
VMISC | 20,48kB, | 4,10kB, | 4,10kB, | 0B, | 0B, | 0B |
ANON allocations look fine. They seem to be in line with the RocksDB memory usage wiki page. FILE_ROCKS_SST are the SST files mapped into memory, and I see an issue here. mmap-ing into memory is not an issue, but very much is in RAM (as seen in the rss column). I assume that any file block read at least ONCE stays in RAM forever, and the reason is that it is in the OS-Cache.
I observed the behavior for a week, digged deep into Linux memory management and got insight in the issue. Answering my own questions now:
Swap usage: The mystery of SWAP being used though there is free memory is solved. Debian 9 puts services in a cgroup, and that cgroup does not follow /proc/sys/vm/swappiness . If others have similar issues, here is how you find your service/cgroup specific settings:
# cat /proc/1052/cgroup | grep memory
4:memory:/system.slice/my-svc.service
# cat /sys/fs/cgroup/memory/system.slice/my-svc.service/memory.swappiness
60
# cat /proc/sys/vm/swappiness
0
We see that the default swappiness diverges from the swappiness of the cgroup.
Memory leakage: There may still be memory leakage, as I see that about 10GB memory is "lost" over the course of 6 days. I need to follow this over a longer period and see where and if memory usage settles.
Summarizing a long ticket
I am now closing this ticket, as the first aspect ("swap usage") is not a RocksDB issue. If I have more insight about memory consumption or leaks I will open a new ticket. Thanks for the help @ajkr and @yiwu-arbug
hi, nice to meet you
Do you resolve this problem?
I still see high memory usage, but I am not sure if it is a true leak. I am still in the process of finding root cause and proof. The most likely cause of high memory usage in my case are iterators. In my opinion Iterators hold way more memory than documented in section "Blocks pinned by iterators" on the page https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB.
If you want a possible answer, this is what can contribute to the problem of high memory usage.
Effectively Iterators likely hold:
The latter could make your memory usage explode. If you use an Iterator pool, there are 2 issues:
Hey @christian-esken, thanks for your very detailed debugging, is very interesting. I face similar issue when I use my DB with many CFs and 1+ trillion keys. Several times my DB memory usage increases for 20-23 GB(all what i have free) in few hours, very randomly. My issue is like what you describe, very slow memory increasing day for day. I have dumped the memory from the process, and i have several 256MB memory pointers filled up with zeroes. pmap returns for that these lines:
Address Perm Offset Device Inode Size Rss Pss Referenced Anonymous ShmemPmdMapped Shared_Hugetlb Private_Hugetlb Swap SwapPss Locked Mapping
7f2130681000 rw-s 00000000 00:14 275866516 4 4 4 4 0 0 0 0 0 0 0 KaE2af (deleted)
7f2130682000 rw-s 00000000 00:14 275866515 4 0 0 0 0 0 0 0 0 0 0 uoVzy9 (deleted)
7f2130683000 rw-s 00000000 00:14 275866514 4 0 0 0 0 0 0 0 0 0 0 Li57V3 (deleted)
7f2130684000 rw-s 00000000 00:14 275866513 4 0 0 0 0 0 0 0 0 0 0 Nw6GjY (deleted)
7f2130685000 rw-s 00000000 00:14 275866512 4 4 4 4 0 0 0 0 0 0 0 2k4gHS (deleted)
7f2130686000 rw-s 00000000 00:14 275866511 4 4 4 4 0 0 0 0 0 0 0 Fc7R4M (deleted)
7f2130687000 rw-s 00000000 00:14 275866510 4 4 4 4 0 0 0 0 0 0 0 dZdusH (deleted)
7f2130688000 rw-s 00000000 00:14 275866509 4 4 4 4 0 0 0 0 0 0 0 KZn7PB (deleted)
7f2130689000 rw-s 00000000 00:14 275866508 4 4 4 4 0 0 0 0 0 0 0 wSvLdw (deleted)
7f213068a000 rw-s 00000000 00:14 275866507 4 4 4 4 0 0 0 0 0 0 0 p0GqBq (deleted)
7f213068b000 rw-s 00000000 103:08 3407899 4 4 0 4 0 0 0 0 0 0 0 pym-6301-pj6joa5c (deleted)
7f213068c000 rw-p 00000000 00:00 0 1816 1816 1409 1748 1816 0 0 0 0 0 0
7f2130852000 rwxp 00000000 00:00 0 4 4 0 0 4 0 0 0 0 0 0
7f2130853000 rw-s 00000000 00:14 237344868 4 4 0 4 0 0 0 0 0 0 0 g3ghBt (deleted)
7f2130854000 rw-s 00000000 00:14 237344867 4 4 4 4 0 0 0 0 0 0 0 LpPrSA (deleted)
7f2130855000 rw-s 00000000 00:14 237344866 4 4 4 4 0 0 0 0 0 0 0 M7ZC9H (deleted)
7f2130856000 rw-s 00000000 00:14 237344865 4 4 4 4 0 0 0 0 0 0 0 VXOOqP (deleted)
7f2130857000 rw-s 00000000 00:14 237344864 4 4 0 4 0 0 0 0 0 0 0 75b1HW (deleted)
7f2130858000 rw-s 00000000 00:14 237344863 4 4 4 4 0 0 0 0 0 0 0 7QReZ3 (deleted)
I think that was used by the compiler. But I don't know why not free it... My most CF's have option no_block_cache=true
I use python layer for the RocksDB (version: 6.8.1).
For your suggestions: I do iterate very rare and after use is closed.
FYI, RAM usage seems to be directly correlated with the number of *.sst
files. RAM usage can be clamped by limiting the number of open *.sst
files, by setting options max_open_files
to a suitable number. I am seeing approx 1GB RAM for every 15 sst
files. RAM usage appears to be about 3x disk usage.
Details:
sst
files. options.max_open_files = -1
(the rocks default) then the number of sst
files shoots up to just under the ulimit -n
setting (1024 is the bash default), after which I get silent data corruption. Closing the app, and re-opening it again seems to give a database with the pre-corruption data in it.ulimit -n 4096
allows more open sst
files, and vast amounts of RAM and disk to be consumed.options.max_open_files = 312
seems to limit the number of sst
files to about 170, limits disk usage to about 6.6GB, and RAM usage to about 20GB.The root cause of the large explosion of the number of sst
file appears to be due to the app performing a rapid number of edits to some large values. The values in the 50KB range, and these are being updated hundreds of times a second. This appears to cause rocks to write out tens of MBytes/sec of log records, which turns into the GB/minute or so increase in file storage space. I don't understand why compaction isn't mashing these back down to a manageable size.
The fact that RAM is 3x the disk usage suggests that each sst file is memory-mapped three times!?
My app also has a postgres backend, so I can compare. In synthetic benchmarks, rocks is 2x or 3x faster than postgres. Yayy! That's a lot! For the production data runs, its about 10x slower. Booo! I don't understand the root cause of this, either.
What is the symptom of the "silent corruption"? When you close and re-open the DB, is the original data there or is it still "corrupted"?
I am not that familiar with RocksDB 5.17 so I have no idea if anything has been fixed in this area since that time.
What is the symptom of the "silent corruption"?
Data that I expect to be present goes missing, as if the entire record is absent. To catch it in the act, I also wrote a log of everything written to as an ascii file. I can see that a write occurs to that key, with the correct data, whereas a later read returns nothing. At this point, the app is check-stopped, and I can manually verify that the expected key-value pair really is missing.
When you close and re-open the DB, is the original data there or is it still "corrupted"?
When I exit the app and restart, I can see that the the original write had gone through, as it is now holding the expected data. I cannot close and reopen without exiting the app, because, despite closing, there are 980 file descriptors that remain open to the 980 sst files, and ulimit
prevents the open from succeeding, with a "too many open files" error. My guess is that rocksdb hits the ulimit max open files limit, doesn't notice, stops returning data to Get()
and iterators. Then, upon close, it forgets that it has all these other files open, and leaves them open.
A more elegant response would have been to run a compaction cycle and reduce the number of open files, but perhaps it is trying to avoid error amplification, and just refusing to do anything more.
I hadn't noticed the RAM usage until I'd increased ulimit -n 4096
after which I ran out of RAM (on a 256 GB box ... while read-modify-writing a 500MB db... Wow!)
The leaking file descriptor problem persists into rockdb version 6.19 github master branch as of yesterday. After closing rocksdb, lsof -p pid |grep sst | wc
shows that there are 71 file descriptors still open. Reopening and closing again causes compaction to run, which removes many of these files; lsof -p pid |grep sst |grep deleted |wc
shows that there are 67 open file descriptors to sst files that have been deleted. Leaving the application idle overnight (12 hours) does not change things; those descriptors appear to have been permanently lost.
Any change you have something holding old files from being deleted, e.g. very old iterators not destroyed, some scheduled compactions not being run for compaction scheduling, etc? Some DB properties to check: "rocksdb.num-live-versions" - old objects that might hold old files "rocksdb.total-sst-files-size" and "rocksdb.live-sst-files-size". The later is Live SST file in the latest LSM-tree, while the first one includes stale files holding by something.
@siying Wow! Thank you! I'm coding in C++ and made the apparently faulty and foolish assumption that iterators were smart pointers that self-deleted when they went out of block scope. Looking at <rocksdb/db.h>
I see that is not the case. Yikes! Thank you, yes, you are right, all my ram/disk problems are almost surely due to failure to delete iterators. I will remeasure and report findings shortly.... done. Wow! all RAM and file usage problems appear to be completely resolved!
Its been a decade since I last coded in a C++ project that required an explicit delete
. Never even occurred to me. Based on my 100% newbie error, I humbly suggest that future versions of the C++ bindings to rocks might convert to using smart pointers for iterators. In C++, its not hard, a few lines of code in the header files. Mem usage of smart pointers is about 3 long words. CPU overhead is low: under the covers, they increment a counter when created, they decrement when the pointer goes out of scope. The only issue is how to make this change in a backwards-compatible way: all the old users will be calling delete, which will confuse the shared-pointer control block when it also goes to delete. Hmm.
@linas How to set options.max_open_files = 300 ? I am facing error-> put error: IO error: While open a file for appending: /root/pmem1/db/001021.sst: Too many open files
Would like to try your method if it can solve my problem. Thanks.
@linas How to set options.max_open_files = 300 ? I am facing error-> put error: IO error: While open a file for appending: /root/pmem1/db/001021.sst: Too many open files
Try ulimit -a
at the bash prompt. Also man getrlimit
@linas How to set options.max_open_files = 300 ? I am facing error-> put error: IO error: While open a file for appending: /root/pmem1/db/001021.sst: Too many open files
Try
ulimit -a
at the bash prompt. Alsoman getrlimit
I try to increase the limit: ulimit -n 16384. But it seems like the benchmark.sh can keep running until the disk is full (501GB used), then it errors out. put error: IO error: No space left on deviceWhile appending to file: /data/pmem0/db/001049.sst: No space left on device
@linas Could you guide me how to set options.max_open_files = 300?
@linas How to set options.max_open_files = 300 ? I am facing error-> put error: IO error: While open a file for appending: /root/pmem1/db/001021.sst: Too many open files
Try
ulimit -a
at the bash prompt. Alsoman getrlimit
I try to increase the limit: ulimit -n 16384. But it seems like the benchmark.sh can keep running until the disk is full (501GB used), then it errors out. put error: IO error: No space left on deviceWhile appending to file: /data/pmem0/db/001049.sst: No space left on device
Default benchmark.sh appears to work on a data set of 1.5TB. With some compaction styles it can go as high as 3TB. If you are OK with running smaller size, you can overwrite number of keys with environment variable NUM_KEYS. Even our official benchmark now runs with much smaller number of keys: https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks
@linas How to set options.max_open_files = 300 ? I am facing error-> put error: IO error: While open a file for appending: /root/pmem1/db/001021.sst: Too many open files
Try
ulimit -a
at the bash prompt. Alsoman getrlimit
I try to increase the limit: ulimit -n 16384. But it seems like the benchmark.sh can keep running until the disk is full (501GB used), then it errors out. put error: IO error: No space left on deviceWhile appending to file: /data/pmem0/db/001049.sst: No space left on device
Default benchmark.sh appears to work on a data set of 1.5TB. With some compaction styles it can go as high as 3TB. If you are OK with running smaller size, you can overwrite number of keys with environment variable NUM_KEYS. Even our official benchmark now runs with much smaller number of keys: https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks
@siying Okay, i will try to reduce the NUM_KEYS. Do you know how to calculate so that we can know what number of keys the disk can take with its space?
I want to konw rocksdb itself close the iterators or not? I found the memory grow unlimitly and never decrese. I didn't use any iterators in my code and I just use the Get() interface only, so I guess rocksdb use lots of iterators by Get(), and not close it.
I have a rocksdb of 400GB size. It uses 40GB RSS and keeps growing, eating up all the Swap. Finally it eats up all memory. The rocksdb version is 5.8.6.
Expected behavior
According to https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB the RocksDB instance should only use about 8GB - I have done calculations with a worst-case scenario.
Index size: 4,61 GB Bloom Filters: 3.9 GB Memtable: 10*128MB = 1,28GB Block Cache: 0.5 GB Blocks pinned by iterators: 14MB Sum: Approximately 10,3GB
I also added memory not mentioned on the Wiki, namely: Memtables pinned by iterators: 1,28GB (estimation: presuming they pin some older memtables) Compaction (I understand that they could double Index and Bloom): 4.61 + 3.9GB Sum: Approximately 9,8GB
All together, taking the worst case scenario, this is 20 GB.
Actual behavior
RocksDB takes way more memory than expected and finally eats up all memory. The memory consumption is far above expectations: Even shortly after start it requires 33GB RSS.
Detail: Swap is filled before buffers/cache. As you see below the swap is full (7GB) but there is still lots of data cached (14GB), so I guess RocksDB is pinning data in memory. Memory loss happens during reading from the DB via prefix iterator (seek, next), because when just writing (on the average 100MB/s to SSD) we do not lose any memory.
dbdir # free -m total used free shared buff/cache available Mem: 64306 49196 293 12 14816 14440 Swap: 7710 7710 0
RocksDB is being used embedded in Java via rocksdbjni. The Java part requires less than 1GB as Heap. Memory usage of that process, as taken from top: VIRT 0.450t RES 0.048t SHR 0.0.12t
I have run pmap on the process, and summed up RSS for the .sst files: The sum for .sst is 34GB.
Steps to reproduce the behavior
We can reproduce the behavior in our application quickly. It never happens if we just write. As soon as we allow clients to read, the memory loss happens. The only operations on this DB are put, prefix-iterator seek() + next(), and deleteRange. One important observation: We use itertators for a short time, then close them to avoid resource locking by the iterator and create a new Iterator. If we chose a shorter time until discarding an Iterator, then we create more iterators and the memory loss is faster.