snapstore: unused snapshots are never deleted

warner commented 3 years ago

What is the Problem Being Solved?

We want to delete unused snapshots. The snapstore is somewhat disconnected from the set of vats which use their snapshots (e.g. the snapstore doesn't know which vats are using which snapshots). And multiple vats might be using a single snapshot, at least for a little while (if we don't write a new snapshot until 20 cranks have been delivered, then all young vats will be using the same starting snapshot).

So we need some sort of refcounts, and something to notice when a snapshot is unused, and to delete the file.

We should probably get this implemented before e.g. moving vat-tp back to xsnap, since it receives a lot of messages (probably more than any other vat in the system), which will produce a lot of snapshots. And making it stateless (#3305) is stalled on some deep science (reconstructing Remotables without replay). The vat-tp state is small and doesn't grow over time, so each snapshot should be minimal (I think we've seen 200-300kB?), but writing a new one every 20 deliveries would be a lot of space.

Description of the Design

Not sure, maybe swingStore should be aware of which keys are being used to hold snapshot names, and could perform the refcounting internally. Or maybe something on the kernel side should track it.

Security Considerations

Test Plan

dckc commented 3 years ago

We want to delete unused snapshots.

It would help me to unpack that a bit. Should I read "we" as the customer, i.e. the validator node operator? (What we the developers want is of relatively little importance, yes?) I suppose the validator node operator wants nodes to consume as little space as is feasible. A simple requirement is "at most one snapshot per live vat". Is there sufficient motivation for anything more complex than that?

And what does "unused" mean? The simplest design I see is to schedule destruction of the previous snapshot of a vat whenever we make a new snapshot for that vat. It happens to align nicely with the simple requirement above. But that wouldn't involve (non-trivial) reference counting.

When would multiple vats share a snapshot? I suppose the zygote design sketch in #2391 suggests at least one snapshot that isn't the most recent snapshot of any live vat. The code in #2848 does something similar with the xs-worker supervisor.

I suppose it's also possible to have multiple vats that share a history, and hence might have exactly the same snapshot. Separating snapshot storage by vat seems simpler than reference counting, to address that possibility.

To capture the customer requirement irrespective of how it's carried out, perhaps a better title is "snapstore: bounded storage per vat"?

warner commented 3 years ago

Validators want to minimize space usage, we (the platform developers) want to support validators.

"at most one snapshot per live vat" is an excellent target.

"unused" means the snapshot ID does not appear in the set of committed DB keys that capture the (snapshot ID, transcript starting point) data for each vat. If a vat needs snapshot X, and the snapstore doesn't have it, the kernel cannot ever page-in that vat, which is kernel-fatal.

Mutiple vats will definitely share the "just SES and supervisor" snapshot for the first 20 cranks of their lives, or until whatever heuristic we settle upon decides it's worth the time to record a new one. If/when we get vat creation fast enough, and people start using short-lived single-purpose contract vats, they might not even make 20 cranks before being terminated. The zygote feature will make it possible to share more-developed snapshots (and if we know we're using a vat as a zygote starting point, we'll certainly want a snapshot to start from, that's where all our performance will come from, unless we also use a fork and copy-on-write trick to just clone the entire Unix process).

But more importantly, if we're identifying snapshots by their hash (which I still think is a good idea), then we're not identifying them by the vat(s) which use them, which means we need some other mechanism to guarantee the availability of the snapshot and minimize storage needs.

Reference counting isn't the only possible solution, but it's the only one that's occurred to me so far. The motivating example might be: we have 1000 vats all launched from the same zygote snapshot, all of which have executed 19 cranks and then needed to be paged out for whatever reason, now how do we not keep 1000x snapshots for them, while still being able to delete that snapshot when the installation is replaced and the zygote parent vat is deleted and all 1000 child vats have either moved on (and have unique snapshots of their own) or been terminated (and no longer need storage of any kind)?

As for multiple vats sharing a significant history and converging on a shared snapshot: I think that's unlikely enough that I'm not worried about trying to deduplicate it, but it's absolutely critical that we maintain availability of the snapshot. So e.g. if two vats do happen to overlap for whatever reason (parallel evolution), we must not delete a snapshot out from under them.

dckc commented 3 years ago

can be postponed for a bit

@warner measured 12,000 deliveries in 2 hrs thru vattp, the busiest vat. Using snapshotInterval = 200 and 0.3 Mb per (compressed) snapshot:

> 12000 / 200
60
> 12000 / 200 * 0.3
18

So that's 18Mb of storage for 2hrs. We can afford that for a while.

dckc commented 3 years ago

snapshot GC using kernelDB

one idea:

when you store a snapshot, add vatID to snapshot's list
when you stop using it, remove your vatID, but leave the snapshot's key
something can come along later and find the keys with empty lists and delete those snapshots

mhofman commented 3 years ago

FYI, my monitor node for the last testnet grew after 2 weeks to about 26GB worth of snapshot, spread over 3093 files, for 36 vats.

Agoric / agoric-sdk