delete historical transcript spans (unless config switch says to retain)

Transforming issue into an Epic with the following list of issues comprising it. Historical description and comments below.

### 1. Switch to keep only operational data for regular nodes
- [ ] https://github.com/Agoric/agoric-sdk/issues/9388
- [ ] https://github.com/Agoric/agoric-sdk/issues/9387
- [ ] https://github.com/Agoric/agoric-sdk/issues/9386

### 2. Prune old data
- [ ] Write CLI instructions for validators to prune old data
- [ ] https://github.com/Agoric/agoric-sdk/issues/9100

### 3. Store historical items as compressed files
- [ ] https://github.com/Agoric/agoric-sdk/issues/10036
- [ ] (Optional) https://github.com/Agoric/agoric-sdk/issues/9389
- [ ] (Optional) https://github.com/Agoric/agoric-sdk/issues/8448

What is the Problem Being Solved?

Our mainnet chain state is growing pretty fast, and we'd like to make it smaller.

The state is stored in two places: cosmos databases (~/.agoric/data/ in LevelDBs like application.db and blockstore.db), and the Agoric-specific "Swing-Store" (~/.agoric/data/agoric/swingstore.sqlite). This ticket is focused on the swing-store. The largest component of the swing-store, in bytes, are the historical transcript spans, because they contain information about every delivery made to every vat since the beginning of the chain. These records, plus their SQL overhead, is an order of magnitude larger than anything else in the swing-store, and comprise about 97% of the total space.

As of today (29-Mar-2024), the fully-VACUUM'ed SQLite DB is 147 GB, growing at about 1.1 GB/day. There are 38M transcript items, whose total size (sum(length(item))) is 116 GB.

For normal chain operations, we only need the "current transcript span" for each vat: enough history to replay the deliveries since the last heap snapshot. That is enough to bring a worker online, to the same state it was at at the last commit point. Our current snapshotInterval = 200 configuration means there will never be more than 200 deliveries in the current span (one span per vat), so they will be fairly small. The total size of all six thousand -ish current transcript items is a paltry 16 MB. A pruned version of today's swingstore DB would be about 4.6 GB in size.

However, when we first launched the chain, we were concerned that we might find ourselves needing to replay the entire incarnation, from the very beginning, perhaps as a last-ditch way to upgrade XS. We decided to have cosmic-swingset configure the swing-store to retain all transcript spans, not just the current one.

I think it's no longer feasible to retain that much data. We carefully designed the swing-store to keep hashes of all historical spans, even if we delete the span data itself, so we retain the ability to safely re-install the historical data (i.e. with integrity, not relying upon the data provider for correctness, merely availability). So in a pinch, we could find a way to distribute the dataset to all validators and build a tool to restore their databases (perhaps one vat at a time).

Description of the Design

We need a user-accessible switch to control whether a node retains historical transcripts or not. The swingstore constructor call takes an option to control this, but it isn't plumbed into e.g. app.toml.

Then, we need to configure two or more archive nodes to retain their historical transcripts, so we'll have data to recover from should we ever need it. All existing nodes have that data (and it is currently included in state-sync snapshots), so mainly we need to have at least two reliable existing nodes not change their configuration to prune the old spans.

Then, really, we should build and test some tooling to:

copy a single vat's historical spans from a populated swingstore, into an archive that we can easily distribute
build a tool to repopulate the historical spans from that archive
- maybe define a directory of historical archives, that the node can examine at startup, and validate+import any data it doesn't currently have
at least sketch out a workflow by which validators can restore this data if they need it, and get sufficiently up-to-date that we can perform the replay-since-incarnation-start task
- note that we'd have to write a lot of new code to actually do that replay, which we should not attempt to do now: we just need sufficient confidence that it could be done, and that we have a safe+usable copy of the data somewhere

Then, either we change the default setting to prune the historical transcripts, or we tell all validators that they can save 90% of their disk space by changing the setting and let them make the decision.

This is closely related to what artifacts we put in state-sync snapshots. Currently we put all historical artifacts in those snapshots, and require them to all be present when restoring from a snapshot. We would need to change both sides: omit the historical spans during export, and stop requiring them during import (each of which is probably a one-line change to the swingstore API options bag). As a side-effect, state-sync snapshots would become a lot smaller, and would take less time to export and to import.

Security Considerations

The hashes we use on the transcript spans mean there are no integrity considerations. However, this represents a significant availability change.

We don't know that we'll ever need to replace-since-incarnation-start, and we don't know that we could afford to do so anyways:

until we manage to restart/upgrade vats on a regular basis, the incarnations are long
- most vats have never been upgraded, so they have a single incarnation, with all activity since mainnet launch last year
replaying an entire incarnation means recapitulating the entire history of that vat, akin to replaying every transaction on the chain since launch, albeit one vat at a time
that could take weeks or months of CPU time
so it would almost certainly need to execute in the background somehow, during an extensive warmup period, before all validators switch to the new image, which will/would require a lot more engineering effort and coordination

We think, but have not implemented or tested, that we can restore this data later, given the (incomplete) plan above. We don't know how hard it will be to implement that plan, or to practically deliver the replacement artifacts. How large will they be? Will we need to deliver all of them, or just for a few vats? How can anyone get sufficiently up-to-date? We might have a situation where the chain halts for upgrade, and all validators must fetch the last few spans from an archive node before they can restart their nodes, introducing extra delays into an upgrade process that's already complicated (if we're resorting to such a big replay).

But we do know that the space this consumes is considerable, and growing fast. I'm really starting to think that we can't afford to have all nodes keep all that data anymore, and to hope/rely-upon the work we've done being sufficient to restore the data in the (unlikely?) event that we ever need it.

cc @mhofman @ivanlei for decisions

Scaling Considerations

Once deployed, this will remove about 1.0 GB per day from the disk-space growth of a mainnet validator (more, if transaction volume increases, e.g. because new price-feed oracles are deployed). If/when a validator does a "state-sync prune" (where they restore from a state-sync snapshot), they'll get a one-time reduction of 152 GB from their disk usage. The resulting swingstore should be about 4.6 GB, and will grow at about 34 MB per day. (The cosmos DB will be unaffected, and is currently about 182 GB, depending upon pruning and snapshot settings).

Test Plan

I believe @mhofman 's integration tests will exercise the state-sync export and import parts. I think we should have manual tests that an archive node will retain the data we care about.

Upgrade Considerations

We must decide whether the prune-historical-spans behavior is the default for all nodes (and have archive nodes configure themselves to retain those spans), or if retain-historical-spans is the default (and make sure validators know how to prune if desired). If we choose the former, then upgrade automatically makes things somewhat better (reduces the growth rate). A state-sync refresh/reload/pruning would still be necessary to shed the bulk of the data.

A few more ideas:

The switch to control pruning could be stored in the SQLite DB itself, or read from a config file in the same directory, if that made sense
We could add a swingstore config option that names a directory, and whenever we rollover a span (ie create a new historical span), the compressed contents would be written to a new file in that directory, before their items were deleted from transcriptItems. We might do this in commit(), in a new prune() function, after the block's contents have been committed, but before returning to the caller. prune() would write out files for all historical transcripts (there may be multiple, even for a single vat), DELETE their items, then do a second commit().
- Alternatively, we could write the historical spans out to a second SQLite DB, instead of raw files (which would remove some of the concerns around fflush() on those files).
- This might be a useful component of a service which makes the historical artifacts, and a feed of their names, available for download
We could write a tool that takes these artifacts and injects them into a swingstore, independently of swingset. Or maybe configure an import directory, and every once in a while, swingstore looks in it to see if there are any artifacts to be imported, reads their contents, checks the hashes, INSERT INTO transcriptItems, commits, then deletes the files.
- I'm thinking about how we transition a node from "prune your old spans" to "stop pruning and also accept replacements", and how gets up-to-date in preparation for a big replay. We need to disable the pruning first, then have something find out what spans are needed, fetch them, drop them into the place where they'll be imported. If the big replay is going to happen in the background, we need to grab the oldest missing spans first.

I only skimmed through for now, but wanted to capture a couple thoughts:

if we have an option, I think we should plumb it from the cosmos app config (toml file)
I really like the idea of keeping historical artifacts separate from the main SQLite DB. However in that case I think we should go back to compressed files on disk. We might even be able to simplify some of the DB schema, if we assume any historical snapshot or transcript span is simply a file on disk.

Agreed, although one benefit of storing the old spans in a (separate) SQLite DB is that commit means commit: SQLite ensures the data will be properly flushed to disk, and an ill-timed power failure won't threaten it. If we use discrete files, we ought to do our own fflush() or equivalent, which is a drag. OTOH, that would certainly make it easier to publish, maybe as easy as pointing a plain webserver at the directory and throwing a CDN in front of it.

One note, the options.keepTranscripts we pass into makeSwingStore currently controls whether old transcript items are deleted during rolloverSpan or rolloverIncarnation. It defaults to true, but we probably want to set it to false to achieve the goal of this ticket.

The tricky part is that it uses a single SQL statement, DELETE FROM transcriptItems WHERE vatID = ? AND position < ?, to delete everything older than the start of the current span. The first time that is run on a DB with a lot of history, it is going to delete a lot of items, and that's a problem: a quick test on our largest mainnet vat (v43-walletFactory), with 8.2M items as of last week, took two full seconds on a fast machine to just count the items. It took 22 minutes to delete them all, and the statement created a 27 GiB .wal file (to hold the uncommitted txn) while it ran.

As part of #8928 I'm adding a delete-a-little-at-a-time API to the swing-store, but it's aimed at vat deletion: there's not an obvious way to incorporate it into rolloverSpan/rollverIncarnation. That would leave us in an uncomfortable position: anyone who had done a state-sync prune of their node would be fine, but anyone who still has the original data would experience multiple massive stalls (and a 10-20% disk-usage spike) some random number of blocks after the upgrade which switches to keepTranscripts: false, as the long-history vats hit the end of their snapInterval=200 deliveries cycles and trigger a span rollover. Every vat has a pretty long history right now, so this would happen a lot, until all those old items finished being deleted.

One option is to change rolloverSpan to only delete the previous span's items, not all earlier items: we already have the startPos, endPos from that span, so we could change the DELETE to bound position on both sides. That would achieve the goal of flattening out the item growth without also incurring a gigantic deletion event. We'd wind up with a sparse transcript: populated items for spans 0..X, then missing items for spans X+1..CURRENT-1, then populated items for span CURRENT.

Then we'd need to decide what to do about rolloverIncarnation. If we simply did the change above, we'd have the same sparseness/gaps, which isn't the worst situation to be in.

A deeper fix would be to change swingstore to have a cleanup(budget) API, which the host would call at some moderate fixed rate (maybe one cleanup(5) call each block). There's a tricky question of consensus, though. The new #8928 APIs (transcriptStore.deleteVatTranscripts(vatID, budget=5)) affect consensus state because they delete the span records themselves, with hashes, which are shadowed into IAVL via the export-data: we aren't just de-populating the items, we're forgetting about the old spans completely. The kernel calls them for the vats that it knows have been terminated but not yet fully-deleted, so every block makes a small in-consensus change that deletes some DB data.

But the population status of transcript items is not part of consensus, partially to allow different validators to make different space-vs-replay-hassle decisions. To rate-limit the deletion of items for terminated vats, I'm having the kernel delete a budget-limited number of spans each block, and then the swingstore deletes both those span records (which will always already be present) and their transcript items (which may or may not be populated).

Perhaps the way to go is for rolloverSpan to delete all the span records right away (40k for that largest vat, maybe 100ms to execute, although it does mean 40k IAVL deletions too), and then have a non-consensus-changing swing-store cleanup(budget) method which is allowed to delete any transcript item that does not fit into a span record. I'm not sure how to make that efficient.. the most general case would allow a patchwork of spans, and we'd delete one item at a time, with a DB query for each one like SELECT COUNT(*) FROM transcriptSpans WHERE vatID=? AND startPos>=? AND endPos<? to see if it's retained or not. And it would have to start by getting a list of vatIDs, so it could iterate through each one's items separately. I don't really want to change the schema for this (ie adding a list of ranges of items that are known to not have span records, and which can be deleted), but we'd be within our rights to have the swingstore keep some state in RAM to speed things up, since it doesn't matter which unreferenced items get deleted (different validators, with different reboot histories, are allowed to delete different items). So maybe at swingstore startup, or the first time that cleanup() is called, we scan for all vatIDs, find the ranges of populated items for each (maybe we assume that we get two contiguous ranges: one for the current span, then possibly a second for the not-yet-deleted historical ones). Then in RAM we track those historical ranges, which would provide an easy way to pick off 100 at a time without even doing any additional DB queries.

@mhofman and I decided:

the cosmos app.toml config will have an option to control swingset/swingstore's keepTranscripts
- if omitted, the default will depend upon the app.toml pruning settings:
- if pruning = "nothing", we assume this is an archive node, and we set keepTranscripts: true
- otherwise we set keepTranscripts: false
- an explicit app.toml setting will override that default
we'll change the import/export modes from replay to operational
- state-sync exports will get small, by omitting old spans
- state-sync imports will ignore any old spans
- note that state-sync is still slow because of the huge number of IAVL keys, which is an #8400 / #8401 problem, and is being worked on elsewhere
we'll change swing-store to treat keepTranscripts: false as meaning "delete only the items from a single span during rollverSpan instead of "delete everything that is old", so we don't swamp the SQL db

Nodes which never change their keepTranscripts mode (e.g. archive nodes always have true, nodes launched from state-sync export always have false) will get obvious behavior: keep everything, or never have (and never generate) anything.

Nodes which transition from one mode to another (existing nodes that change their app.toml and restart, or nodes which are started from a non-state-sync "community snapshot" / raw dump but which edit their app.toml to drop old spans) will observe their growth rates go mostly flat (as we stop accumulating old spans), but will not shed any old data. To get rid of the old data, they must either do a state-sync prune, or some manual /usr/bin/sqlite3 CLI hacks.

Note that state-sync prune will get easier in more recent cosmos-sdk versions (maybe 0.47??), which introduces the ability to state-sync export to a local directory, and to import from the same, instead of only using the P2P network protocol (and thus depending upon some other node to publish their snapshot).

We also sketched out the rest of the tools that we can build later to support the creation/consumption of historical spans:

we can change swing-store and adapt Chip's #8318 / #8693 work to write spans out to compressed files on disk as they become old
- we should write to a tempfile, fsync, and atomic-rename, so we are never confused by partial files on disk
- this must tolerate overwriting an existing file, because the kernel could be interrupted in the middle of a write, and the block re-executed later
that could help archive nodes by moving these now-static span files out of swingstore.sqlite and into plain files in a nearby directory, making SQL faster
- we could also write them an external tool which remediates the old data, by extracting the items, compressing them into plain files, then DELETE FROM the original rows
- when complete, a VACUUM would reduce their disk usage: 90% of the original swingstore.sqlite would be transformed into a 10x-smaller (compression) set of external files
next, we build a mechanism for archive nodes to publish this directory of compressed old spans somehow (perhaps simply uploading them to an S3 bucket)

The consumers of this data are going to be validators / RPC nodes / followers who have seen a forum post that says we'll be doing a whole-incarnation replay of vatID v43 on some date a few weeks from now. To avoid significant downtime, we need to pre-fetch and pre-execute as much of that replay as possible. So at that point:

we build a mechanism for validators/rpc-node/followers to download this directory, and keep up with new additions; the moral equivalent of while /bin/true; do rsync -r $URL/ ./local/; sleep 1; done (but without the overhead of re-checking all the old files every time)
- depending upon the protocol, the files may be unvalidated against the span hashes in the real swingstore's transcriptSpans table, and they'll be validated later, before execution
- or, we have this downloader tool also read hashes from the real swingstore as it runs, and only write fully-validated data to disk
then we have an external tool which reads (and maybe validates) transcript items from the files, and executes them (eg with a new version of XS)
- this will result in new transcript items, with different computron counts, but hopefully everything else will be the same
- also, every snapInterval deliveries we'll get new heap snapshots
- we'll store both the new transcript items/spans and snapshots in a stripped-down single-vat swingstore instance
- that protects us against interrupted execution and lost progress
when the time/block arrives to activate/swap-in the replay:
- the kernel watches the stripped-down swingstore and waits for it to finish execution (the highest deliverynum in the replay swingstore should match the deliverynum of the real swingstore)
- the kernel halts and destroys the vat worker (just like it does during normal upgrade)
- the kernel deletes all the transcript items/spans and snapshots for the latest incarnation
- the kernel copies all the transcript items/spans and snapshots from the replay DB into the real swingstore
the external tool needs to know to stop execution, so it doesn't try to replay the new post-replay deliveries
- it could watch the real swingstore until the next successful delivery is committed, and then delete the replay DB and all the old compressed spans
and of course we have to figure out what an archive node should do with all of this: does it remember both histories? how do we tell them apart?

Agoric / agoric-sdk