Open mdcallag opened 2 years ago
To reproduce this I start with this patch:
diff --git a/db/db_impl/db_impl_files.cc b/db/db_impl/db_impl_files.cc
index da87aa5cd..e011dad50 100644
--- a/db/db_impl/db_impl_files.cc
+++ b/db/db_impl/db_impl_files.cc
@@ -526,6 +526,10 @@ void DBImpl::PurgeObsoleteFiles(JobContext& state, bool schedule_only) {
number >= state.min_pending_output;
if (!keep) {
files_to_del.insert(number);
+ } else {
+ ROCKS_LOG_INFO(immutable_db_options_.info_log, "Keepme live %d c1=%d c2=%d", (int) number,
+ sst_live_set.find(number) != sst_live_set.end(),
+ number >= state.min_pending_output);
}
break;
case kBlobFile:
@@ -602,9 +606,15 @@ void DBImpl::PurgeObsoleteFiles(JobContext& state, bool schedule_only) {
// = -1, then no need to delete or schedule delete these files since they
// will be removed by their owner, e.g. the primary instance.
if (!own_files) {
+ if (type == kTableFile) {
+ ROCKS_LOG_INFO(immutable_db_options_.info_log, "Keepme notown %d", (int) number);
+ }
continue;
}
if (schedule_only) {
+ if (type == kTableFile) {
+ ROCKS_LOG_INFO(immutable_db_options_.info_log, "Keepme scheduleonly %d", (int) number);
+ }
InstrumentedMutexLock guard_lock(&mutex_);
SchedulePendingPurge(fname, dir_to_sync, type, number, state.job_id);
} else {
diff --git a/db/version_set.cc b/db/version_set.cc
index 9d37db84f..af68f8b06 100644
--- a/db/version_set.cc
+++ b/db/version_set.cc
@@ -6109,6 +6109,7 @@ void VersionSet::GetObsoleteFiles(std::vector<ObsoleteFileInfo>* files,
}
}
obsolete_files_.swap(pending_files);
+ ROCKS_LOG_INFO(db_options_->info_log, "Keepme getobsolete %" PRIu64 " min_pending, %d of %d moved\n", min_pending_output, (int) files->size(), (int) (files->size() + obsolete_files_.size()));
std::vector<ObsoleteBlobFileInfo> pending_blob_files;
for (auto& blob_file : obsolete_blob_files_) {
And then I run db_bench like this. The problems occur during overwrite.
numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --undefok=use_blob_cache,use_shared_block_and_blob_cache,blob_cache_size,blob_cache_numshardbits,prepopulate_block_cache --db=/data/m/rx --wal_dir=/data/m/rx --num=4000000000 --key_size=20 --value_size=400 --block_size=8192 --cache_size=225485783040 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --use_direct_reads --use_direct_io_for_flush_and_compaction --prepopulate_block_cache=1 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --report_interval_seconds=1 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=4 --compaction_style=1 --num_levels=40 --universal_compression_size_percent=80 --pin_l0_filter_and_index_blocks_in_cache=1 --universal_min_merge_width=2 --universal_max_merge_width=20 --universal_size_ratio=1 --universal_max_size_amplification_percent=100 --universal_allow_trivial_move=1 --universal_compression_size_percent=-1 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --seed=1660684521 --report_file=bm.uc.nt32.cm1.d1.sc4.tm/keep/benchmark_fillseq.wal_disabled.v400.log.r.csv
numactl --interleave=all ./db_bench --benchmarks=overwrite --use_existing_db=1 --sync=0 --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --undefok=use_blob_cache,use_shared_block_and_blob_cache,blob_cache_size,blob_cache_numshardbits,prepopulate_block_cache --db=/data/m/rx --wal_dir=/data/m/rx --num=4000000000 --key_size=20 --value_size=400 --block_size=8192 --cache_size=225485783040 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --use_direct_reads --use_direct_io_for_flush_and_compaction --prepopulate_block_cache=1 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --report_interval_seconds=1 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=4 --compaction_style=1 --num_levels=40 --universal_compression_size_percent=80 --pin_l0_filter_and_index_blocks_in_cache=1 --universal_min_merge_width=2 --universal_max_merge_width=20 --universal_size_ratio=1 --universal_max_size_amplification_percent=100 --universal_allow_trivial_move=1 --duration=7200 --threads=32 --merge_operator="put" --seed=1660689764 --report_file=bm.uc.nt32.cm1.d1.sc4.tm/keep/benchmark_overwrite.t32.s0.log.r.csv
Then I grep LOG:
grep -E 'table_file_del|compaction_finished|Keepme' /data/m/rx/LOG
In these examples overwrite starts at 15:42:46. From the previous comment the database directory reaches ~7TB between 162604 and 163007. There is a table_file_deletion event at 15:43:26
2022/08/16-15:43:26.280134 1898167 EVENT_LOG_v1 {"time_micros": 1660689806280130, "job": 1104, "event": "table_file_deletion", "file_number": 67046}
But then no more until 16:29:50, which explains why the database directory reaches 7TB.
2022/08/16-16:29:50.371093 1898164 EVENT_LOG_v1 {"time_micros": 1660692590371087, "job": 821, "event": "table_file_deletion", "file_number": 479984}
2022/08/16-16:29:50.371173 1898164 EVENT_LOG_v1 {"time_micros": 1660692590371171, "job": 821, "event": "table_file_deletion", "file_number": 479983}
From the logging I added to GetObsoleteFiles, the number of files in obsoletefiles reaches 317006. None are moved to files (meaning they can be deleted) because min_pending_output=67303 from 15:43:18 until 16:29:49.
2022/08/16-15:43:18.560721 1898169 [/version_set.cc:6112] Keepme getobsolete 67303 min_pending, 0 of 55 moved
...
2022/08/16-16:29:49.554543 1898170 [/version_set.cc:6112] Keepme getobsolete 67303 min_pending, 0 of 319178 moved
2022/08/16-16:29:49.627389 1898172 [/version_set.cc:6112] Keepme getobsolete 67303 min_pending, 0 of 319178 moved
2022/08/16-16:29:49.802757 1898164 [/version_set.cc:6112] Keepme getobsolete 480013 min_pending, 317006 of 319190 moved
I assume this is the long running compaction that blocked things (kept min_pending_output at 67303) because compaction_time_micros ~= 2721 seconds and total_output_size ~= 867GB
2022/08/16-16:29:50.002195 1898164 (Original Log Time 2022/08/16-16:29:49.265941) EVENT_LOG_v1 {"time_micros": 1660692589249330, "job": 821, "event": "compaction_finished", "compaction_time_micros": 2721624360, "compaction_time_cpu_micros": 9388960476, "output_level": 39, "num_output_files": 55055, "total_output_size": 931328757382, "num_input_records": 4000032144, "num_output_records": 4000000000, "num_subcompactions": 4, "output_compression": "LZ4", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 80, 192, 443, 2616, 15301, 0, 0, 0, 55055]}
Grepping the compaction IO stats output during overwrite, from the Sum column the LSM size doesn't exceed 1.23TB. Assuming the worst-case of 2X transient space-amp (all files compacted concurrently) the database directory shouldn't exceed ~2.46TB.
For my sake, the call stack to delete an SST after compaction ends is this. However, that depends on the call to FindObsoleteFiles finding files that can be deleted, and FindObsoleteFiles depends on GetObsoleteFiles finding the files which doesn't happen as explained above. MinObsoleteSstNumberToKeep is also relevant.
#0 0x00000000006a029d in rocksdb::DeleteScheduler::DeleteFile (this=this@entry=0x7ffff76d8440, file_path=..., dir_to_sync=..., force_bg=force_bg@entry=false) at file/delete_scheduler.cc:63
#1 0x00000000006aba6d in rocksdb::SstFileManagerImpl::ScheduleFileDeletion (this=0x7ffff76d8380, file_path=..., path_to_sync=..., force_bg=<optimized out>) at file/sst_file_manager_impl.cc:423
#2 0x00000000006a2209 in rocksdb::DeleteDBFile (db_options=db_options@entry=0x7ffff76e9330, fname=..., dir_to_sync=..., force_bg=force_bg@entry=false, force_fg=<optimized out>) at file/file_util.cc:123
#3 0x00000000005435ad in rocksdb::DBImpl::DeleteObsoleteFileImpl (this=0x7ffff76e9000, job_id=6, fname=..., path_to_sync=..., type=rocksdb::kTableFile, number=15) at db/db_impl/db_impl_files.cc:357
#4 0x0000000000549385 in rocksdb::DBImpl::PurgeObsoleteFiles (this=this@entry=0x7ffff76e9000, state=..., schedule_only=schedule_only@entry=false) at db/db_impl/db_impl_files.cc:608
#5 0x000000000053797e in rocksdb::DBImpl::BackgroundCallCompaction (this=0x7ffff76e9000, prepicked_compaction=0x0, bg_thread_pri=rocksdb::Env::LOW) at db/db_impl/db_impl_compaction_flush.cc:2984
#6 0x000000000053812e in rocksdb::DBImpl::BGWorkCompaction (arg=<optimized out>) at db/db_impl/db_impl_compaction_flush.cc:2670
#7 0x00000000007c13a8 in std::function<void ()>::operator()() const (this=0x7ffff69fd010) at /mnt/gvfs/third-party2/libgcc/4959b39cfbe5965a37c861c4c327fa7c5c759b87/9.x/platform009/9202ce7/include/c++/9.3.0/bits/std_function.h:688
#8 rocksdb::ThreadPoolImpl::Impl::BGThread (this=0x7ffff7630200, thread_id=0) at util/threadpool_imp.cc:319
#9 0x00000000007c1743 in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper (arg=0x7ffff7667f70) at util/threadpool_imp.cc:360
#10 0x00007ffff7ea9661 in std::execute_native_thread_routine (__p=0x7ffff76b2880) at ../../../.././libstdc++-v3/src/c++11/thread.cc:80
#11 0x00007ffff7db720c in start_thread (arg=0x7ffff69fe700) at pthread_create.c:479
#12 0x00007ffff7b7616f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
On the bright side, this is explained in the wiki.
To prevent it from happening, we take use of a good feature that new files are created using the file name of an incremental number. Before each flush or compaction job runs, we remember the number of latest SST file created at that time. If a full garbage collection runs before the job finishes, all SST files with number larger than that will be kept. The condition is released after the job is finished. Since multiple flush and compaction jobs can run in parallel, all SST files with number larger than number the earliest ongoing job remembers will be kept. It possible that we have some false positive, but they will be cleared up eventually.
This has compaction IO stats from the end of overwrite for tests that use buffered IO and then O_DIRECT for compaction. The results are from RocksDB version 6.20
First from a server in the Google cloud with 30 CPUs (hyperthread disabled), 22 client threads, 64G RAM and 3TB of EBS "persistent SSD" storage with max_background_jobs=8. I reduced the size of the database in half (loads 2B KV pairs) to avoid the out of space error. The results are here.
From the buffered IO result:
From the O_DIRECT result:
From a faster server - 256G RAM, 80 HW threads (40 CPUs with hyperthread enabled) and fast NVMe direct attached SSD the results are here.
Looking at the stats for L38 with buffered IO:
And with O_DIRECT
When running tools/benchmark.sh for universal compaction with a database much larger than memory I get unexpected out of disk space errors. The interesting details are:
The problem is that SST files are not deleted because of the min_pending_output check in GetObsoleteFiles. But I will save that for the next comment on this issue.
Compaction IO stats when Sum shows the current database is ~1TB but the database directory has up to ~7TB of SST files -- the size grows and shrinks.
Partial output from a script that monitors what is in the database directory. The script does this every 3 seconds:
The output below shows the database directory reaches ~7TB from SST files.