cockroachdb / pebble

RocksDB/LevelDB inspired key-value database in Go
BSD 3-Clause "New" or "Revised" License
4.79k stars 444 forks source link

scan_internal: support skip-shared iteration of external files #3083

Open itsbilal opened 10 months ago

itsbilal commented 10 months ago

Similar to shared files, two Pebble instances can reduce bytes scanned / transferred by sharing file metadata or location info and then ingesting them as external files + some local files that contain the diff between the external files and the intended state.

As part of this issue, explore if such functionality makes sense for CockroachDB's use-cases, and if it does, update ScanInternal to see external files as shareable files for "skip-shared" iteration mode. IngestAndExcise will also need to be updated to support taking external files instead of shared files.

Jira issue: PEBBLE-77

Epic CRDB-40359

RaduBerinde commented 9 months ago

Some more nuance from @msbutler:

I’m also not sure how frequently “virtualized” snapshots (i.e. send sst metadata with uri instead of the actual sst) will be used during the OR download job: currently in dissagregated storage cluster, the sender only creates a virtualized snapshot if all SST’s in L5/L6 are shared (i.e the physical ssts’s belong in s3), else the sender falls back to old style snapshots. In a disaggregated storage cluster, nearly all snapshots are virtualized, since nearly all SST’s in L5-L6 are shared. But in a normal cluster, the OR download job and pebble compaction will download those L5/L6 files ASAP, leading to fewer opportunities to conduct virtual snapshots. Further, given how wide of a key span SSTs in L5 and L6 are, as soon as one or 2 of these files materialize, it seems quite unlikely we can take advantage of virtualized snapshots because a replica key span will likely intersect with a downloaded sst.

jbowens commented 9 months ago

I'm hopeful that this won't be necessary for online restore if we presplit ranges appropriately. It would increase the overall complexity of the initial online restore preview considerably. It's much easier to reason about it if there's one mechanism to link these external sstables into the LSM, and it's during the restore linking phase.