cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go
BSD 3-Clause "New" or "Revised" License
4.91k stars 455 forks source link

db: reduce Ingest I/O with manifest+db locks #2112

Open jbowens opened 2 years ago

jbowens commented 2 years ago

To identify the target level of M ingested files in a LSM with N levels, we may need to seek in O(N × M) sstables to look for data overlap. This is all performed during application while holding DB.mu and the manifest lock, which prevents concurrent flushes and compactions from scheduling or completing while the mutex is held.

Ingest performs these data overlap checks while holding the manifest lock for a consistent view of the LSM. However, for the purposes of detecting data overlap, ingest could be optimistic about its view of the LSM.

When Ingest acquires the manifest lock during the apply step, Ingest can exclude any sstables or memtables with largest sequence numbers already observed during data overlap calculations. Ingest only needs to search in any new sstables for more recent overlap introduced since its optimistic scan.

Related to #25 and this TODO to reduce overlap checks by using L0 sublevels.

Also related to #1683 which will remove the strict requirement of preventing file boundary overlap, making target level a function of data overlap exclusively.

Jira issue: PEBBLE-173

jbowens commented 1 year ago

This may be used to improve the estimate of ApproxIngestedIntoL0Bytes for flushable ingests. The precomputed data overlap can be used to determine which sstables overlapped L0 before entering the commit pipeline.

jbowens commented 1 year ago

Screenshot 2023-05-12 at 3 33 53 PM

I suspect in these graphs we see ingestions holding up the commit pipeline, causing a divergence between the WAL fsync latency and the raft log commit latency.

RaduBerinde commented 6 months ago

This is my current thinking:

When restoring a table, we can expect many external ingestions in the same area so the cache should help across ingestions too. In any case, to minimize I/O while holding mutexes, we will do an optimistic "dry run" without the mutex; then, if the version has changed, we run it again under the mutex, and it will benefit from all the cached regions from the dry run.

jbowens commented 6 months ago

We can just find the overlapping files and check for overlap with those.

If a file's bounds are wholly contained within the ingested sstable bounds, I think we should also assume the file is non-empty and has data overlap without checking. During imports, ingested sstables can be wide in keyspace and we wouldn't want to actually look for data overlap in every overlapping file.