facebook / rocksdb

A library that provides an embeddable, persistent key-value store for fast storage.
http://rocksdb.org
GNU General Public License v2.0
28.52k stars 6.31k forks source link

iterator `Seek` slowdown after delete many keys #10300

Open Wine93 opened 2 years ago

Wine93 commented 2 years ago

Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://groups.google.com/forum/#!forum/rocksdb or https://www.facebook.com/groups/rocksdb.dev

Hi, guys~ I found that the Iterator->Seek() is very slow after delete many keys, and I found a related issue #5265, but i am not sure my problem is same as this one, so can you give me some advice? or if my problem is indeed the same as issue #5265, how can we solve it completely?

Iterator->Seek("/prefix/") // cost about 20 us

for i = 1, 100000, 1 do // delete 100000 keys DB->Delete("/prefix/" + str(i)) end

Iterator->Seek("/prefix/") // cost about 13 ms


* pref context for `Seek`:
> cost 12762 us

perf context(user_key_comparison_count = 420038 block_cache_hit_count = 1 block_read_count = 0 block_read_byte = 0 block_read_time = 0 block_cache_index_hit_count = 0 index_block_read_count = 0 block_cache_filter_hit_count = 0 filter_block_read_count = 0 compression_dict_block_read_count = 0 secondary_cache_hit_count = 0 block_checksum_time = 0 block_decompress_time = 0 get_read_bytes = 0 multiget_read_bytes = 0 iter_read_bytes = 88 internal_key_skipped_count = 180000 internal_delete_skipped_count = 30000 internal_recent_skipped_count = 0 internal_merge_count = 0 write_wal_time = 0 get_snapshot_time = 0 get_from_memtable_time = 0 get_from_memtable_count = 0 get_post_process_time = 0 get_from_output_files_time = 0 seek_on_memtable_time = 1716 seek_on_memtable_count = 2 next_on_memtable_count = 210000 prev_on_memtable_count = 0 seek_child_seek_time = 8258 seek_child_seek_count = 3 seek_min_heap_time = 512 seek_internal_seek_time = 9508 find_next_user_entry_time = 12749803 write_pre_and_post_process_time = 0 write_memtable_time = 0 write_thread_wait_nanos = 0 write_scheduling_flushes_compactions_time = 0 db_mutex_lock_nanos = 0 db_condition_wait_nanos = 0 merge_operator_time_nanos = 0 write_delay_time = 0 read_index_block_nanos = 0 read_filter_block_nanos = 0 new_table_block_iter_nanos = 2013 new_table_iterator_nanos = 0 block_seek_nanos = 2651 find_table_nanos = 0 bloom_memtable_hit_count = 1 bloom_memtable_miss_count = 1 bloom_sst_hit_count = 1 bloom_sst_miss_count = 0 key_lock_wait_time = 0 key_lock_wait_count = 0 env_new_sequential_file_nanos = 0 env_new_random_access_file_nanos = 0 env_new_writable_file_nanos = 0 env_reuse_writable_file_nanos = 0 env_new_random_rw_file_nanos = 0 env_new_directory_nanos = 0 env_file_exists_nanos = 0 env_get_children_nanos = 0 env_get_children_file_attributes_nanos = 0 env_delete_file_nanos = 0 env_create_dir_nanos = 0 env_create_dir_if_missing_nanos = 0 env_delete_dir_nanos = 0 env_get_file_size_nanos = 0 env_get_file_modification_time_nanos = 0 env_rename_file_nanos = 0 env_link_file_nanos = 0 env_lock_file_nanos = 0 env_unlock_file_nanos = 0 env_new_logger_nanos = 0 get_cpu_nanos = 0 iter_next_cpu_nanos = 0 iter_prev_cpu_nanos = 0 iter_seek_cpu_nanos = 12756520


* iostat context:

thread_pool_id = 4 bytes_read = 0 bytes_written = 0 open_nanos = 0 allocate_nanos = 0 write_nanos = 0 read_nanos = 0 range_sync_nanos = 0 fsync_nanos = 0 prepare_write_nanos = 0 logger_nanos = 0 cpu_write_nanos = 0 cpu_read_nanos = 0 file_io_stats_by_temperature.hot_file_bytes_read = 0 file_io_stats_by_temperature.warm_file_bytes_read = 0 file_io_stats_by_temperature.cold_file_bytes_read = 0 file_io_stats_by_temperature.hot_file_read_count = 0 file_io_stats_by_temperature.warm_file_read_count = 0 file_io_stats_by_temperature.cold_file_read_count = 0



### Expected behavior
`Iterator->Seek()` is fast

### Actual behavior
`Iterator->Seek()` is slow

### Steps to reproduce the behavior
`none`
cbi42 commented 2 years ago

Hi, I think this is sort of expected as the Seek() call goes through all tombstones in the memtable to try to find the first entry for the iterator. As from perf context, most of the time is spent on find_next_user_entry_time. I think try DeleteRange to combine all the deletions or CompactRange to do a full compaction and remove all the tombstones might help.

caipengbo commented 2 years ago

Yes, that's a problem. We also encounter such problems in the production environment.

Wine93 commented 2 years ago

Hi, I think this is sort of expected as the Seek() call goes through all tombstones in the memtable to try to find the first entry for the iterator. As from perf context, most of the time is spent on find_next_user_entry_time. I think try DeleteRange to combine all the deletions or CompactRange to do a full compaction and remove all the tombstones might help.

@cbi42 Thank you for your reply, but we can't use DeleteRange or SingleDelete in our application, so we try to tunning rocksdb to trigger compaction for deleted entrys (point tombstones) as soom as possible, it can reduce range scan performance skews in SST file:

options.level_compaction_dynamic_level_bytes = true  // level-0 SST will compacted into bottom level directly
options.level0_file_num_compaction_trigger = 1  // number of files to trigger level-0 compaction
options.max_bytes_for_level_base = 1073741824  // bottom level total data size (1GB)
option.table_properties_collector_factories.emplace_back(rocksdb::NewCompactOnDeletionCollectorFactory(10000, 1000))

The above technologys are all for SST, but for memtables with many point tombstones which not flushed to level-0, how can we speed up the prefix scan?

@caipengbo Do you have some tunning experiences or suggestion for this? :)

BTW: the above tuning's effect maybe same as manual compaction (CompactRange).

caipengbo commented 2 years ago

but we can't use DeleteRange or SingleDelete in our application

What's the reason why you can't use DeleteRange?

Wine93 commented 2 years ago

but we can't use DeleteRange or SingleDelete in our application

What's the reason why you can't use DeleteRange?

Actually, our application received a delete request every times which invoked by user, it's not controled by us. In some scenarios, user will delete all keys in one prefix one by one, also the order of deletion is not consecutive, so we can't convert the point delete to range delete.

Wine93 commented 2 years ago

Actually, our application received a delete request every times which invoked by user, it's not controled by us. In some scenarios, user will delete all keys in one prefix one by one, also the order of deletion is not consecutive, so we can't convert the point delete to range delete.

So why not add a DELETE tag?  Converts a user delete to a normal write.

I think the prefix scan is still slow even if we added a Delete tag for each deleted key, because we should check whether it has Delete tag for every key during scan, if it has, we will skip it, same as what rocksdb does now, so it will costs a lot of time to find the first exist key.

caipengbo commented 2 years ago

So why not add a DELETE tag? Converts a user delete to a normal write.

My bad, I forgot something.

AdjoiningCat commented 1 year ago

Is there any any progress? We also encounter this problems.