Closed ever0de closed 2 weeks ago
Most of the counts arise from skipping entries where the InternalKey.Trailer
type is set to DEL
.
Since the data represents an orderbook where each order has a globally unique key, it seems this is causing the observed behavior.
I suspect that the elision-only compaction may not have been functioning properly.
Could this issue be related to the ReadSamplingMultiplier
option? When it was set to -1, the situation gradually resolved over time.
Based on the current observations, In my opinion
(*Iterator).sampleRead()
, read-triggered compaction does not occur, possibly because the numOverlappingLevels
does not reach 2.
https://github.com/cockroachdb/pebble/blob/9f3904a705d60b9832febb6c6494183d92c8f556/iterator.go#L819-L880How large are the KVs that are being deleted? One mitigation that may help the compaction heuristics prioritize compacting these tombstones is to use Batch.DeleteSized
if you know the size of the value being deleted. When provided, compaction picking can use this information to better prioritize compaction of point tombstones.
Unfortunately in currently tagged releases, Pebble's compaction heuristics around point tombstones are only focused on reducing space amplification. If space amplification incurred is minimal and the point tombstones are in different levels than the points they delete, there's no mechanism to prioritize the compaction of the point tombstones.
As of last week, master includes https://github.com/cockroachdb/pebble/commit/28840262ebcf55013b726e584cd9218400dd5eca which introduces a heuristic that seeks to reduce the density of point tombstones to resolve exactly this problem. Unfortunately, there's not a documented upgrade process to get from the current tagged release to master yet.
Since you mentioned elision-only compactions, I want to clarify that those compactions only re-write a sstable in-place to remove obsolete data that's deleted by tombstones within the same file. This can only happen in the presence of snapshots (DB.NewSnapshot
).
Thank you so much for your quick response!
How large are the KVs that are being deleted? One mitigation that may help the compaction heuristics prioritize compacting these tombstones is to use Batch.DeleteSized if you know the size of the value being deleted. When provided, compaction picking can use this information to better prioritize compaction of point tombstones.
The size will be very small (probably just a few tens of bytes). Since we already know the size of the value, we'll try using DeleteSized
As of last week, master includes https://github.com/cockroachdb/pebble/commit/28840262ebcf55013b726e584cd9218400dd5eca which introduces a heuristic that seeks to reduce the density of point tombstones to resolve exactly this problem. Unfortunately, there's not a documented upgrade process to get from the current tagged release to master yet.
Are there any significant feature differences between version v1.1.2
and the master branch (currently at commit 99abcf76cc2171d1bcf5eb076f897535c105e053)?
We'll review the features and consider adopting them, but they look very promising.
Since you mentioned elision-only compactions, I want to clarify that those compactions only re-write a sstable in-place to remove obsolete data that's deleted by tombstones within the same file. This can only happen in the presence of snapshots (DB.NewSnapshot).
Hmm... This is quite perplexing to me. We observed the elision-only compactions metric in places where we do not use the DB.NewSnapshot()
function.
https://github.com/cockroachdb/pebble/blob/9f3904a705d60b9832febb6c6494183d92c8f556/compaction_picker.go#L1374-L1382
Although I raised the issue after reviewing the (*compactionPickerByScore).pickAuto(..)
code, could this behavior actually be caused by move compactions?
Separately, could the frequent occurrence of this issue, where many tombstones are found, be due to read-triggered compaction?
Separately, could the frequent occurrence of this issue, where many tombstones are found, be due to read-triggered compaction?
Yeah, I think it's possible that read-triggered compactions exacerbate the problem by compacting from L5 into L6, reducing the compaction-picking score for L5. If L5 is too small relative to L6 (eg, because of these read-triggered compactions), Pebble won't even consider picking a L5->L6 compaction until it's large enough again. In the meantime, a large volume of point tombstones can collect in L5.
Hmm... This is quite perplexing to me. We observed the elision-only compactions metric in places where we do not use the DB.NewSnapshot() function.
Hrm, I don't have an explanation for that. Are you ingesting sstables using DB.Ingest
and do those sstables contain tombstones (either point or range)?
Hrm, I don't have an explanation for that. Are you ingesting sstables using DB.Ingest and do those sstables contain tombstones (either point or range)?
No, there isn't anything else. We only perform Set/Delete operations using Batch. When you say there is no explanation, do you mean that this could not be caused by a move compaction, as referenced in the code at?https://github.com/cockroachdb/pebble/blob/9f3904a705d60b9832febb6c6494183d92c8f556/compaction_picker.go#L1374-L1382
EDIT) Ah, I see. I missed that. It seems that the issue could indeed be related to the snapshot, as mentioned in the code at https://github.com/cockroachdb/pebble/blob/9f3904a705d60b9832febb6c6494183d92c8f556/compaction_picker.go#L1569-L1572
When you say there is no explanation, do you mean that this could not be caused by a move compaction, as referenced in the code at?
Ah, I wasn't thinking. I was thinking that even the move compaction case required open snapshots, but that's not true. Move compactions can move tombstones into L6, and then we'll schedule an elision-only compaction to clear them out.
It seems I’m experiencing some confusion.
pickElisionOnlyCompaction
, fileMetadata.LargestSeqNum
can only be greater than or equal to env.earliestSnapshotSeqNum
if there is an existing snapshot. -> :question:
pickElisionOnlyCompaction
be triggered as a result of a move compaction?Could you help point 2?
https://github.com/cockroachdb/pebble/issues/3881#issuecomment-2305105014
I cannot figure out how this could have passed the 2 if
condition in the code.
Here are the metrics after applying the patch from commit 3d14906a0e0c (using default options). In practice, there was a noticeable increase in the TombstoneDensity
compaction count, and the latency appears stable.
details)
Thank you for your response! I believe this issue has been resolved, so I am closing it.
Environment
Version: v1.1.2 Total Data Size: 5GB Resources:
memory: 10GiB
Options
[Version] pebble_version=0.1 [Options] bytes_per_sync=524288 cache_size=1073741824 cleaner=delete compaction_debt_concurrency=1073741824 comparer=leveldb.BytewiseComparator disable_wal=false flush_delay_delete_range=10s flush_delay_range_key=0s flush_split_bytes=2097152 format_major_version=16 l0_compaction_concurrency=10 l0_compaction_file_threshold=500 l0_compaction_threshold=2 l0_stop_writes_threshold=1000 lbase_max_bytes=67108864 max_concurrent_compactions=3 max_manifest_file_size=134217728 max_open_files=16384 mem_table_size=67108864 mem_table_stop_writes_threshold=4 min_deletion_rate=134217728 merger=pebble.concatenate read_compaction_rate=16000 read_sampling_multiplier=16 strict_wal_tail=true table_cache_shards=12 table_property_collectors=[] validate_on_ingest=false wal_dir= wal_bytes_per_sync=0 max_writer_concurrency=0 force_writer_parallelism=false secondary_cache_size_bytes=0 create_on_shared=0 [Level "0"] block_restart_interval=16 block_size=32768 block_size_threshold=90 compression=Snappy filter_policy=rocksdb.BuiltinBloomFilter filter_type=table index_block_size=262144 target_file_size=2097152 [Level "1"] block_restart_interval=16 block_size=32768 block_size_threshold=90 compression=Snappy filter_policy=rocksdb.BuiltinBloomFilter filter_type=table index_block_size=262144 target_file_size=4194304 [Level "2"] block_restart_interval=16 block_size=32768 block_size_threshold=90 compression=Snappy filter_policy=rocksdb.BuiltinBloomFilter filter_type=table index_block_size=262144 target_file_size=8388608 [Level "3"] block_restart_interval=16 block_size=32768 block_size_threshold=90 compression=Snappy filter_policy=rocksdb.BuiltinBloomFilter filter_type=table index_block_size=262144 target_file_size=16777216 [Level "4"] block_restart_interval=16 block_size=32768 block_size_threshold=90 compression=Snappy filter_policy=rocksdb.BuiltinBloomFilter filter_type=table index_block_size=262144 target_file_size=33554432 [Level "5"] block_restart_interval=16 block_size=32768 block_size_threshold=90 compression=Snappy filter_policy=rocksdb.BuiltinBloomFilter filter_type=table index_block_size=262144 target_file_size=67108864 [Level "6"] block_restart_interval=16 block_size=32768 block_size_threshold=90 compression=Snappy filter_policy=none filter_type=table index_block_size=262144 target_file_size=134217728Write Pattern: A batch of data sized between 600KB to 800KB is generated approximately every 0.6 seconds.
Batch.{Set/Delete}
Currently, the total number of keys in the database is around 36 million. We are experiencing high latency when using an iterator to search for 1,120 specific records, with a total latency of approximately 22.37ms and a seek time of 1.25ms.
We have observed that the
PointCount
in the Iterator Stats occasionally spikes to very high levels, correlating with the increased latency. Despite the LSM tree metrics not indicating an inverted LSM state, there seems to be frequent occurrences of(*mergingIter).findNextEntry().
pebble metrics:
result:
iter stats:
Could you provide advice on optimizing options to improve overall iterator performance or insights into any potential misconfigurations in our current settings?
Jira issue: PEBBLE-250