Notion epic

Link: https://www.notion.so/persist-compaction-2-0-70ff7c89574f471c8a7d99eb51192b0e Product brief: Status: Done Prioritization: Now Estimated delivery date: 2024-06-30

10,000 ft overview

Persist wants a black box that it can put batch descriptions into and get (possibly incremental, tbd) descriptions of compaction work to perform out
The current implementation has problems (see below). Want to rethink persist compaction from first principles, then bring the implementation in line with the new design. Anything and everything we have built for compaction is on the table. Throw it out
However, ideally there would not be major architectural changes to persist itself :)

No spec. We can't offer guarantees about when compaction work will be done nor how much read amp, write amp, consolidation will result when the work is done
It doesn't deal well with a new version of compaction code picking back up from where an old version left off
- e.g. This is the source of the shards with huge numbers of batch parts that are not getting compacted together
There's a mismatch between our durable representation of compaction state and our in-mem one
- e.g. This is the source of the "state diff should apply cleanly" bugs we patched
The switch from "renditions" to "multi-shard persist_sink" invalidated some assumptions we made about s3 PUT/GET pricing and it has since become a problem
We're not making the write amp/read amp trade off well. This one I think could be saved with more unsatisfying heuristics, but in the spirit of having a spec I'd rather have something principled

Persist is well positioned to sacrifice higher read amp in favor of lower write amp
Ideally we'd be able to offer hard guarantees (or at least eventual guarantees?) on the above as well as on consolidation
IMO introducing a new batch should not block on compaction work (e.g. imagine new batches go to an internal WAL), but possible this is up for debate
Question: Do we want to run compaction work on remote nodes? We might not want persist compaction competing with storage nodes for bandwidth or compute nodes for cpu

DD Spine backpressures on incremental compaction. Not doing so is a big departure and in the limit makes it impossible to offer a "spec". Maybe we can do something where we don't backpressure in the common case, but if things back up "too much" backpressure to prevent the worst case (losing our spec guarantees).
Frank thinks Spine generalizes to each level having a queue of K items and compacting batches of N of them at a time. All spine-requested compactions would be completed, making K and N into our new knobs to tune write vs read amp. TODO: Frank to write out some more detail on this.
Frank agrees that it should be possible to efficiently represent the internal state of Spine as differential collections (with a timestamp of the persist SeqNo).
- One might be something like (batch_id, level, ordering_within_level) (where ordering_within_level is maybe a sort of sequence-y number-y thing and we continually promote the oldest ones to the new layer, compactions would reuse the number of one of the inputs)
- Another is (batch_id, various_batch_metadata).
- TODO: Frank to write out some more detail on this.

### Tasks
- [ ] https://github.com/MaterializeInc/materialize/issues/23848