db: reduce Ingest I/O with manifest+db locks

jbowens commented 2 years ago

To identify the target level of M ingested files in a LSM with N levels, we may need to seek in O(N × M) sstables to look for data overlap. This is all performed during application while holding DB.mu and the manifest lock, which prevents concurrent flushes and compactions from scheduling or completing while the mutex is held.

Ingest performs these data overlap checks while holding the manifest lock for a consistent view of the LSM. However, for the purposes of detecting data overlap, ingest could be optimistic about its view of the LSM.

When Ingest acquires the manifest lock during the apply step, Ingest can exclude any sstables or memtables with largest sequence numbers already observed during data overlap calculations. Ingest only needs to search in any new sstables for more recent overlap introduced since its optimistic scan.

Also related to #1683 which will remove the strict requirement of preventing file boundary overlap, making target level a function of data overlap exclusively.

Jira issue: PEBBLE-173

jbowens commented 1 year ago

This may be used to improve the estimate of ApproxIngestedIntoL0Bytes for flushable ingests. The precomputed data overlap can be used to determine which sstables overlapped L0 before entering the commit pipeline.

jbowens commented 1 year ago

Screenshot 2023-05-12 at 3 33 53 PM

I suspect in these graphs we see ingestions holding up the commit pipeline, causing a divergence between the WAL fsync latency and the raft log commit latency.

RaduBerinde commented 6 months ago

This is my current thinking:

refactor the code to stop using levelIter. We can just find the overlapping files and check for overlap with those.
store an "overlap cache" in each FileMetadata. The overlap cache consists of a small number of non-overlapping UserKeyBounds; each one is either a known empty region of the file, or a known "full" region of the file (either just a point, or a range corresponding to a range del or range key).
when we check with overlap with a FileMetadata, we first take a look at the overlap cache and see if it can give us an answer (i.e. if we are completely inside of a "known empty" region, or if we intersect with any "known full" region). If not, we check for overlap and add one or two regions in the cache (one for any empty region we find, and one for any full region we find).

When restoring a table, we can expect many external ingestions in the same area so the cache should help across ingestions too. In any case, to minimize I/O while holding mutexes, we will do an optimistic "dry run" without the mutex; then, if the version has changed, we run it again under the mutex, and it will benefit from all the cached regions from the dry run.

jbowens commented 6 months ago

We can just find the overlapping files and check for overlap with those.

If a file's bounds are wholly contained within the ingested sstable bounds, I think we should also assume the file is non-empty and has data overlap without checking. During imports, ingested sstables can be wide in keyspace and we wouldn't want to actually look for data overlap in every overlapping file.

cockroachdb / pebble

db: reduce Ingest I/O with manifest+db locks #2112