facebook / rocksdb

A library that provides an embeddable, persistent key-value store for fast storage.
http://rocksdb.org
GNU General Public License v2.0
28.25k stars 6.27k forks source link

Appearance of files much larger than target_file_size_base option #12956

Closed S3-GAEULKIM closed 2 weeks ago

S3-GAEULKIM commented 3 weeks ago

Hi to all Rocksdb users and dev group!

I raised an issue because I have questions about the option called target_file_size_base. As far as I know, this option determines the size of each SST file after the compaction process. During the compaction process, after merge sort, it is popped from the merging iterator and the SST file is constructed in the sorted order. When targer_file_size_base (=64MB) is encountered, the file is finished and a new SST file is written. Therefore, I understood that no matter what size the flushed file was, it would follow the size of the option from L0 to L1.

But today, after running the fillrandom workload, I found something strange while looking at the LOG file. Initially, L0 was mostly 54MB. (I think this is because it was created as an SST file with duplicates removed after using a 64MB memtable as a target. If my thinking is wrong, please leave a comment on this as well.) As I read more LOG, I found that during L0 - L1 compaction, there was also a file with an L0 file of 120MB. (I think this is due to Intra-L0 compaction.) However, I was confused when I suddenly encountered a file in L1 that was not 64MB in size and was not smaller than 64MB. Even the problem files in L1 (files larger than 64MB) that I think were all created by the same compaction job.) Please refer to the photos and log below.

Is what I expect from the option target_file_size_base different from what I thought? I would really appreciate it if you could answer what the problem is.

Expected behavior

target_file_size_base = 64MB At all levels except L0, files equal to or smaller than 64MB were expected to exist.

Actual behavior

image

[default]: Compaction start summary: Base version 30 Base level 0, inputs: [47(54MB) 46(54MB) 44(54MB) 43(218MB)], [35(49MB) 36(49MB) 38(49MB) 40(49MB)] 2024/08/12-00:51:19.814738 886355 EVENT_LOG_v1 {"time_micros": 1723423879814733, "job": 33, "event": "compaction_started", "compaction_reason": "LevelL0FilesNum", "files_L0": [47, 46, 44, 43], "files_L1": [35, 36, 38, 40], "score": 1.49364, "input_data_size": 606980450, "oldest_snapshot_seqno": -1}

"job": 33, "event": "table_file_creation", "file_number": 49, "file_size": 111384597 "job": 33, "event": "table_file_creation", "file_number": 52, "file_size": 111122587 "job": 33, "event": "table_file_creation", "file_number": 54, "file_size": 111172625 "job": 33, "event": "table_file_creation", "file_number": 58, "file_size": 111583789

Even the problem files in L1 (files larger than 64MB) that I think were all created by the same compaction job.)

Steps to reproduce the behavior

./db_bench --benchmarks=fillrandom,stats --db=/work/tmp/fillrandom --num=1000000000 --threads=16 --writes=62500000 --max_background_jobs=16 --key_size=48 --value_size=43 --compression_type=none --use_direct_reads=1 --use_direct_io_for_flush_and_compaction=1 --target_file_size_base=67108864 --bloom_bits=10 --disable_wal=1 --max_write_buffer_number=4 --memtablerep=skip_list --write_buffer_size=67108864 --seed=1722070161

I used this command.

Thank you for reading.

cbi42 commented 2 weeks ago

There is some flexibility in compaction output file size: https://github.com/facebook/rocksdb/blob/c62de54c7c398d39dd6009185c284f163514daf3/db/compaction/compaction.cc#L355-L359. The motivation I believe is to reduce write-amp by aligning compaction output files to the boundaries of next-level files: https://github.com/facebook/rocksdb/pull/10655.

S3-GAEULKIM commented 2 weeks ago

@cbi42 Thank you for your answer. I will take note of the part you mentioned.