cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go
BSD 3-Clause "New" or "Revised" License
4.77k stars 442 forks source link

[Q] Performance degradation in iterators due to tombstones present in L5/L6 #3881

Closed ever0de closed 2 weeks ago

ever0de commented 3 weeks ago

Environment

Version: v1.1.2 Total Data Size: 5GB Resources:

Write Pattern: A batch of data sized between 600KB to 800KB is generated approximately every 0.6 seconds.

Currently, the total number of keys in the database is around 36 million. We are experiencing high latency when using an iterator to search for 1,120 specific records, with a total latency of approximately 22.37ms and a seek time of 1.25ms.

When we rewrite all the data into a new Pebble instance, the total latency improves significantly to a range of 494µs to 2ms.

We have observed that the PointCount in the Iterator Stats occasionally spikes to very high levels, correlating with the increased latency. Despite the LSM tree metrics not indicating an inverted LSM state, there seems to be frequent occurrences of (*mergingIter).findNextEntry().

pebble metrics:

      |                             |       |       |   ingested   |     moved    |    written   |       |    amp
level | tables  size val-bl vtables | score |   in  | tables  size | tables  size | tables  size |  read |   r   w
------+-----------------------------+-------+-------+--------------+--------------+--------------+-------+---------
    0 |     0     0B     0B       0 |  0.00 |    0B |     0     0B |     0     0B |     0     0B |    0B |   0  0.0
    1 |     0     0B     0B       0 |  0.00 |    0B |     0     0B |     0     0B |     0     0B |    0B |   0  0.0
    2 |     0     0B     0B       0 |  0.00 |    0B |     0     0B |     0     0B |     0     0B |    0B |   0  0.0
    3 |     0     0B     0B       0 |  0.00 |    0B |     0     0B |     0     0B |     0     0B |    0B |   0  0.0
    4 |     9   32MB     0B       0 |  0.50 |    0B |     0     0B |     0     0B |     0     0B |    0B |   1  0.0
    5 |    70  155MB     0B       0 |  0.29 |    0B |     0     0B |     0     0B |     0     0B |    0B |   1  0.0
    6 |   442  4.5GB     0B       0 |     - |    0B |     0     0B |     0     0B |     0     0B |    0B |   1  0.0
total |   521  4.7GB     0B       0 |     - |    0B |     0     0B |     0     0B |     0     0B |    0B |   3  0.0
-------------------------------------------------------------------------------------------------------------------
WAL: 1 files (0B)  in: 0B  written: 0B (0% overhead)
Flushes: 0
Compactions: 0  estimated debt: 0B  in progress: 0 (0B)
             default: 0  delete: 0  elision: 0  move: 0  read: 0  rewrite: 0  multi-level: 0
MemTables: 1 (256KB)  zombie: 0 (0B)
Zombie tables: 0 (0B)
Backing tables: 0 (0B)
Virtual tables: 0 (0B)
Block cache: 0 entries (0B)  hit rate: 0.0%
Table cache: 1 entries (808B)  hit rate: 0.0%
Secondary cache: 0 entries (0B)  hit rate: 0.0%
Snapshots: 0  earliest seq num: 0
Table iters: 0
Filter utility: 0.0%
Ingestions: 0  as flushable: 0 (0B in 0 tables)

result:

iter stats:

seeked 1 times (1 internal); stepped 1.1K times (213K internal); blocks: 0B cached, 8.7MB not cached (read time: 4.435423ms); points: 213K (26MB keys, 302KB values)
         blocks: 0B cached, 8.7MB not cached (read time: 4.435423ms); points: 213K (26MB keys, 302KB values)
                 block bytes 9.1 MB
                 block read duration 4.435423ms
                 key bytes 28 MB
                 value bytes 309 kB
                 point count 213040
                 point covered by range tombstones count 0
                 separated {0 0 0}

image

Could you provide advice on optimizing options to improve overall iterator performance or insights into any potential misconfigurations in our current settings?

Jira issue: PEBBLE-250

ever0de commented 3 weeks ago

Most of the counts arise from skipping entries where the InternalKey.Trailer type is set to DEL. Since the data represents an orderbook where each order has a globally unique key, it seems this is causing the observed behavior.

I suspect that the elision-only compaction may not have been functioning properly.


Could this issue be related to the ReadSamplingMultiplier option? When it was set to -1, the situation gradually resolved over time.

image

Based on the current observations, In my opinion

  1. Read-triggered compaction frequently occurs for prefixes where data is not often written, though occasional SET/DELETE operations happen.
  2. Elision-only compaction does not occur as frequently, allowing tombstones to persist down to L6.
  3. As a result, the Iterator encounters many tombstones, especially in L6. However, despite frequent encounters in (*Iterator).sampleRead(), read-triggered compaction does not occur, possibly because the numOverlappingLevels does not reach 2. https://github.com/cockroachdb/pebble/blob/9f3904a705d60b9832febb6c6494183d92c8f556/iterator.go#L819-L880
jbowens commented 3 weeks ago

How large are the KVs that are being deleted? One mitigation that may help the compaction heuristics prioritize compacting these tombstones is to use Batch.DeleteSized if you know the size of the value being deleted. When provided, compaction picking can use this information to better prioritize compaction of point tombstones.

Unfortunately in currently tagged releases, Pebble's compaction heuristics around point tombstones are only focused on reducing space amplification. If space amplification incurred is minimal and the point tombstones are in different levels than the points they delete, there's no mechanism to prioritize the compaction of the point tombstones.

As of last week, master includes https://github.com/cockroachdb/pebble/commit/28840262ebcf55013b726e584cd9218400dd5eca which introduces a heuristic that seeks to reduce the density of point tombstones to resolve exactly this problem. Unfortunately, there's not a documented upgrade process to get from the current tagged release to master yet.

jbowens commented 3 weeks ago

Since you mentioned elision-only compactions, I want to clarify that those compactions only re-write a sstable in-place to remove obsolete data that's deleted by tombstones within the same file. This can only happen in the presence of snapshots (DB.NewSnapshot).

ever0de commented 3 weeks ago

Thank you so much for your quick response!

How large are the KVs that are being deleted? One mitigation that may help the compaction heuristics prioritize compacting these tombstones is to use Batch.DeleteSized if you know the size of the value being deleted. When provided, compaction picking can use this information to better prioritize compaction of point tombstones.

The size will be very small (probably just a few tens of bytes). Since we already know the size of the value, we'll try using DeleteSized

As of last week, master includes https://github.com/cockroachdb/pebble/commit/28840262ebcf55013b726e584cd9218400dd5eca which introduces a heuristic that seeks to reduce the density of point tombstones to resolve exactly this problem. Unfortunately, there's not a documented upgrade process to get from the current tagged release to master yet.

Are there any significant feature differences between version v1.1.2 and the master branch (currently at commit 99abcf76cc2171d1bcf5eb076f897535c105e053)?

We'll review the features and consider adopting them, but they look very promising.

Since you mentioned elision-only compactions, I want to clarify that those compactions only re-write a sstable in-place to remove obsolete data that's deleted by tombstones within the same file. This can only happen in the presence of snapshots (DB.NewSnapshot).

Hmm... This is quite perplexing to me. We observed the elision-only compactions metric in places where we do not use the DB.NewSnapshot() function.

image

https://github.com/cockroachdb/pebble/blob/9f3904a705d60b9832febb6c6494183d92c8f556/compaction_picker.go#L1374-L1382 Although I raised the issue after reviewing the (*compactionPickerByScore).pickAuto(..) code, could this behavior actually be caused by move compactions?

Separately, could the frequent occurrence of this issue, where many tombstones are found, be due to read-triggered compaction?

jbowens commented 3 weeks ago

Separately, could the frequent occurrence of this issue, where many tombstones are found, be due to read-triggered compaction?

Yeah, I think it's possible that read-triggered compactions exacerbate the problem by compacting from L5 into L6, reducing the compaction-picking score for L5. If L5 is too small relative to L6 (eg, because of these read-triggered compactions), Pebble won't even consider picking a L5->L6 compaction until it's large enough again. In the meantime, a large volume of point tombstones can collect in L5.

Hmm... This is quite perplexing to me. We observed the elision-only compactions metric in places where we do not use the DB.NewSnapshot() function.

Hrm, I don't have an explanation for that. Are you ingesting sstables using DB.Ingest and do those sstables contain tombstones (either point or range)?

ever0de commented 3 weeks ago

Hrm, I don't have an explanation for that. Are you ingesting sstables using DB.Ingest and do those sstables contain tombstones (either point or range)?

No, there isn't anything else. We only perform Set/Delete operations using Batch. When you say there is no explanation, do you mean that this could not be caused by a move compaction, as referenced in the code at?https://github.com/cockroachdb/pebble/blob/9f3904a705d60b9832febb6c6494183d92c8f556/compaction_picker.go#L1374-L1382

EDIT) Ah, I see. I missed that. It seems that the issue could indeed be related to the snapshot, as mentioned in the code at https://github.com/cockroachdb/pebble/blob/9f3904a705d60b9832febb6c6494183d92c8f556/compaction_picker.go#L1569-L1572

jbowens commented 3 weeks ago

When you say there is no explanation, do you mean that this could not be caused by a move compaction, as referenced in the code at?

Ah, I wasn't thinking. I was thinking that even the move compaction case required open snapshots, but that's not true. Move compactions can move tombstones into L6, and then we'll schedule an elision-only compaction to clear them out.

ever0de commented 2 weeks ago

It seems I’m experiencing some confusion.

  1. Move compaction can indeed move tombstones to L6, as it only changes the level without altering the data itself. -> :ok:
  2. In pickElisionOnlyCompaction, fileMetadata.LargestSeqNum can only be greater than or equal to env.earliestSnapshotSeqNum if there is an existing snapshot. -> :question:
    • Can pickElisionOnlyCompaction be triggered as a result of a move compaction?

Could you help point 2?

https://github.com/cockroachdb/pebble/issues/3881#issuecomment-2305105014 I cannot figure out how this could have passed the 2 if condition in the code.

ever0de commented 2 weeks ago

Here are the metrics after applying the patch from commit 3d14906a0e0c (using default options). In practice, there was a noticeable increase in the TombstoneDensity compaction count, and the latency appears stable.

image image image

details) image

ever0de commented 2 weeks ago

Thank you for your response! I believe this issue has been resolved, so I am closing it.