[Q] Performance degradation in iterators due to tombstones present in L5/L6

ever0de commented 3 weeks ago

Environment

Version: v1.1.2 Total Data Size: 5GB Resources:

vcpu: 2
memory: 10GiB

Options
[Version] pebble_version=0.1 [Options] bytes_per_sync=524288 cache_size=1073741824 cleaner=delete compaction_debt_concurrency=1073741824 comparer=leveldb.BytewiseComparator disable_wal=false flush_delay_delete_range=10s flush_delay_range_key=0s flush_split_bytes=2097152 format_major_version=16 l0_compaction_concurrency=10 l0_compaction_file_threshold=500 l0_compaction_threshold=2 l0_stop_writes_threshold=1000 lbase_max_bytes=67108864 max_concurrent_compactions=3 max_manifest_file_size=134217728 max_open_files=16384 mem_table_size=67108864 mem_table_stop_writes_threshold=4 min_deletion_rate=134217728 merger=pebble.concatenate read_compaction_rate=16000 read_sampling_multiplier=16 strict_wal_tail=true table_cache_shards=12 table_property_collectors=[] validate_on_ingest=false wal_dir= wal_bytes_per_sync=0 max_writer_concurrency=0 force_writer_parallelism=false secondary_cache_size_bytes=0 create_on_shared=0 [Level "0"] block_restart_interval=16 block_size=32768 block_size_threshold=90 compression=Snappy filter_policy=rocksdb.BuiltinBloomFilter filter_type=table index_block_size=262144 target_file_size=2097152 [Level "1"] block_restart_interval=16 block_size=32768 block_size_threshold=90 compression=Snappy filter_policy=rocksdb.BuiltinBloomFilter filter_type=table index_block_size=262144 target_file_size=4194304 [Level "2"] block_restart_interval=16 block_size=32768 block_size_threshold=90 compression=Snappy filter_policy=rocksdb.BuiltinBloomFilter filter_type=table index_block_size=262144 target_file_size=8388608 [Level "3"] block_restart_interval=16 block_size=32768 block_size_threshold=90 compression=Snappy filter_policy=rocksdb.BuiltinBloomFilter filter_type=table index_block_size=262144 target_file_size=16777216 [Level "4"] block_restart_interval=16 block_size=32768 block_size_threshold=90 compression=Snappy filter_policy=rocksdb.BuiltinBloomFilter filter_type=table index_block_size=262144 target_file_size=33554432 [Level "5"] block_restart_interval=16 block_size=32768 block_size_threshold=90 compression=Snappy filter_policy=rocksdb.BuiltinBloomFilter filter_type=table index_block_size=262144 target_file_size=67108864 [Level "6"] block_restart_interval=16 block_size=32768 block_size_threshold=90 compression=Snappy filter_policy=none filter_type=table index_block_size=262144 target_file_size=134217728

Write Pattern: A batch of data sized between 600KB to 800KB is generated approximately every 0.6 seconds.

The write operations occur only after all read operations have completed. This happens sequentially in a single thread.
use only Batch.{Set/Delete}

Currently, the total number of keys in the database is around 36 million. We are experiencing high latency when using an iterator to search for 1,120 specific records, with a total latency of approximately 22.37ms and a seek time of 1.25ms.

When we rewrite all the data into a new Pebble instance, the total latency improves significantly to a range of 494µs to 2ms.

We have observed that the PointCount in the Iterator Stats occasionally spikes to very high levels, correlating with the increased latency. Despite the LSM tree metrics not indicating an inverted LSM state, there seems to be frequent occurrences of (*mergingIter).findNextEntry().

pebble metrics:

      |                             |       |       |   ingested   |     moved    |    written   |       |    amp
level | tables  size val-bl vtables | score |   in  | tables  size | tables  size | tables  size |  read |   r   w
------+-----------------------------+-------+-------+--------------+--------------+--------------+-------+---------
    0 |     0     0B     0B       0 |  0.00 |    0B |     0     0B |     0     0B |     0     0B |    0B |   0  0.0
    1 |     0     0B     0B       0 |  0.00 |    0B |     0     0B |     0     0B |     0     0B |    0B |   0  0.0
    2 |     0     0B     0B       0 |  0.00 |    0B |     0     0B |     0     0B |     0     0B |    0B |   0  0.0
    3 |     0     0B     0B       0 |  0.00 |    0B |     0     0B |     0     0B |     0     0B |    0B |   0  0.0
    4 |     9   32MB     0B       0 |  0.50 |    0B |     0     0B |     0     0B |     0     0B |    0B |   1  0.0
    5 |    70  155MB     0B       0 |  0.29 |    0B |     0     0B |     0     0B |     0     0B |    0B |   1  0.0
    6 |   442  4.5GB     0B       0 |     - |    0B |     0     0B |     0     0B |     0     0B |    0B |   1  0.0
total |   521  4.7GB     0B       0 |     - |    0B |     0     0B |     0     0B |     0     0B |    0B |   3  0.0
-------------------------------------------------------------------------------------------------------------------
WAL: 1 files (0B)  in: 0B  written: 0B (0% overhead)
Flushes: 0
Compactions: 0  estimated debt: 0B  in progress: 0 (0B)
             default: 0  delete: 0  elision: 0  move: 0  read: 0  rewrite: 0  multi-level: 0
MemTables: 1 (256KB)  zombie: 0 (0B)
Zombie tables: 0 (0B)
Backing tables: 0 (0B)
Virtual tables: 0 (0B)
Block cache: 0 entries (0B)  hit rate: 0.0%
Table cache: 1 entries (808B)  hit rate: 0.0%
Secondary cache: 0 entries (0B)  hit rate: 0.0%
Snapshots: 0  earliest seq num: 0
Table iters: 0
Filter utility: 0.0%
Ingestions: 0  as flushable: 0 (0B in 0 tables)

result:

seek time: 1.457833ms
total time: 25.92825ms
iterator records count: 1120

iter stats:

seeked 1 times (1 internal); stepped 1.1K times (213K internal); blocks: 0B cached, 8.7MB not cached (read time: 4.435423ms); points: 213K (26MB keys, 302KB values)
         blocks: 0B cached, 8.7MB not cached (read time: 4.435423ms); points: 213K (26MB keys, 302KB values)
                 block bytes 9.1 MB
                 block read duration 4.435423ms
                 key bytes 28 MB
                 value bytes 309 kB
                 point count 213040
                 point covered by range tombstones count 0
                 separated {0 0 0}

Could you provide advice on optimizing options to improve overall iterator performance or insights into any potential misconfigurations in our current settings?

Jira issue: PEBBLE-250

ever0de commented 3 weeks ago

Most of the counts arise from skipping entries where the InternalKey.Trailer type is set to DEL. Since the data represents an orderbook where each order has a globally unique key, it seems this is causing the observed behavior.

I suspect that the elision-only compaction may not have been functioning properly.

Could this issue be related to the ReadSamplingMultiplier option? When it was set to -1, the situation gradually resolved over time.

Based on the current observations, In my opinion

Read-triggered compaction frequently occurs for prefixes where data is not often written, though occasional SET/DELETE operations happen.
Elision-only compaction does not occur as frequently, allowing tombstones to persist down to L6.
As a result, the Iterator encounters many tombstones, especially in L6. However, despite frequent encounters in (*Iterator).sampleRead(), read-triggered compaction does not occur, possibly because the numOverlappingLevels does not reach 2. https://github.com/cockroachdb/pebble/blob/9f3904a705d60b9832febb6c6494183d92c8f556/iterator.go#L819-L880

jbowens commented 3 weeks ago

How large are the KVs that are being deleted? One mitigation that may help the compaction heuristics prioritize compacting these tombstones is to use Batch.DeleteSized if you know the size of the value being deleted. When provided, compaction picking can use this information to better prioritize compaction of point tombstones.

Unfortunately in currently tagged releases, Pebble's compaction heuristics around point tombstones are only focused on reducing space amplification. If space amplification incurred is minimal and the point tombstones are in different levels than the points they delete, there's no mechanism to prioritize the compaction of the point tombstones.

As of last week, master includes https://github.com/cockroachdb/pebble/commit/28840262ebcf55013b726e584cd9218400dd5eca which introduces a heuristic that seeks to reduce the density of point tombstones to resolve exactly this problem. Unfortunately, there's not a documented upgrade process to get from the current tagged release to master yet.

jbowens commented 3 weeks ago

Since you mentioned elision-only compactions, I want to clarify that those compactions only re-write a sstable in-place to remove obsolete data that's deleted by tombstones within the same file. This can only happen in the presence of snapshots (DB.NewSnapshot).

ever0de commented 3 weeks ago

Thank you so much for your quick response!

How large are the KVs that are being deleted? One mitigation that may help the compaction heuristics prioritize compacting these tombstones is to use Batch.DeleteSized if you know the size of the value being deleted. When provided, compaction picking can use this information to better prioritize compaction of point tombstones.

The size will be very small (probably just a few tens of bytes). Since we already know the size of the value, we'll try using DeleteSized

As of last week, master includes https://github.com/cockroachdb/pebble/commit/28840262ebcf55013b726e584cd9218400dd5eca which introduces a heuristic that seeks to reduce the density of point tombstones to resolve exactly this problem. Unfortunately, there's not a documented upgrade process to get from the current tagged release to master yet.

Are there any significant feature differences between version v1.1.2 and the master branch (currently at commit 99abcf76cc2171d1bcf5eb076f897535c105e053)?

(I was hoping there might be some documentation on this. If not, no worries! Haha)

We'll review the features and consider adopting them, but they look very promising.

Since you mentioned elision-only compactions, I want to clarify that those compactions only re-write a sstable in-place to remove obsolete data that's deleted by tombstones within the same file. This can only happen in the presence of snapshots (DB.NewSnapshot).

Hmm... This is quite perplexing to me. We observed the elision-only compactions metric in places where we do not use the DB.NewSnapshot() function.

https://github.com/cockroachdb/pebble/blob/9f3904a705d60b9832febb6c6494183d92c8f556/compaction_picker.go#L1374-L1382 Although I raised the issue after reviewing the (*compactionPickerByScore).pickAuto(..) code, could this behavior actually be caused by move compactions?

Separately, could the frequent occurrence of this issue, where many tombstones are found, be due to read-triggered compaction?

jbowens commented 3 weeks ago

Separately, could the frequent occurrence of this issue, where many tombstones are found, be due to read-triggered compaction?

Yeah, I think it's possible that read-triggered compactions exacerbate the problem by compacting from L5 into L6, reducing the compaction-picking score for L5. If L5 is too small relative to L6 (eg, because of these read-triggered compactions), Pebble won't even consider picking a L5->L6 compaction until it's large enough again. In the meantime, a large volume of point tombstones can collect in L5.

Hmm... This is quite perplexing to me. We observed the elision-only compactions metric in places where we do not use the DB.NewSnapshot() function.

Hrm, I don't have an explanation for that. Are you ingesting sstables using DB.Ingest and do those sstables contain tombstones (either point or range)?

ever0de commented 3 weeks ago

Hrm, I don't have an explanation for that. Are you ingesting sstables using DB.Ingest and do those sstables contain tombstones (either point or range)?

No, there isn't anything else. We only perform Set/Delete operations using Batch. When you say there is no explanation, do you mean that this could not be caused by a move compaction, as referenced in the code at?https://github.com/cockroachdb/pebble/blob/9f3904a705d60b9832febb6c6494183d92c8f556/compaction_picker.go#L1374-L1382

EDIT) Ah, I see. I missed that. It seems that the issue could indeed be related to the snapshot, as mentioned in the code at https://github.com/cockroachdb/pebble/blob/9f3904a705d60b9832febb6c6494183d92c8f556/compaction_picker.go#L1569-L1572

jbowens commented 3 weeks ago

When you say there is no explanation, do you mean that this could not be caused by a move compaction, as referenced in the code at?

Ah, I wasn't thinking. I was thinking that even the move compaction case required open snapshots, but that's not true. Move compactions can move tombstones into L6, and then we'll schedule an elision-only compaction to clear them out.

ever0de commented 2 weeks ago

It seems I’m experiencing some confusion.

Move compaction can indeed move tombstones to L6, as it only changes the level without altering the data itself. -> :ok:
In pickElisionOnlyCompaction, fileMetadata.LargestSeqNum can only be greater than or equal to env.earliestSnapshotSeqNum if there is an existing snapshot. -> :question:
- Can pickElisionOnlyCompaction be triggered as a result of a move compaction?

Could you help point 2?

https://github.com/cockroachdb/pebble/issues/3881#issuecomment-2305105014 I cannot figure out how this could have passed the 2 if condition in the code.

ever0de commented 2 weeks ago

Here are the metrics after applying the patch from commit 3d14906a0e0c (using default options). In practice, there was a noticeable increase in the TombstoneDensity compaction count, and the latency appears stable.

details)

ever0de commented 2 weeks ago

Thank you for your response! I believe this issue has been resolved, so I am closing it.

cockroachdb / pebble

[Q] Performance degradation in iterators due to tombstones present in L5/L6 #3881

Environment