Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
326 stars 206 forks source link

delete historical transcript spans (unless config switch says to retain) #9174

Open warner opened 5 months ago

warner commented 5 months ago

Transforming issue into an Epic with the following list of issues comprising it. Historical description and comments below.

### 1. Switch to keep only operational data for regular nodes
- [ ] https://github.com/Agoric/agoric-sdk/issues/9388
- [ ] https://github.com/Agoric/agoric-sdk/issues/9387
- [ ] https://github.com/Agoric/agoric-sdk/issues/9386
### 2. Prune old data
- [ ] Write CLI instructions for validators to prune old data
- [ ] https://github.com/Agoric/agoric-sdk/issues/9100
### 3. Store historical items as compressed files
- [ ] https://github.com/Agoric/agoric-sdk/issues/10036
- [ ] (Optional) https://github.com/Agoric/agoric-sdk/issues/9389
- [ ] (Optional) https://github.com/Agoric/agoric-sdk/issues/8448

What is the Problem Being Solved?

Our mainnet chain state is growing pretty fast, and we'd like to make it smaller.

The state is stored in two places: cosmos databases (~/.agoric/data/ in LevelDBs like application.db and blockstore.db), and the Agoric-specific "Swing-Store" (~/.agoric/data/agoric/swingstore.sqlite). This ticket is focused on the swing-store. The largest component of the swing-store, in bytes, are the historical transcript spans, because they contain information about every delivery made to every vat since the beginning of the chain. These records, plus their SQL overhead, is an order of magnitude larger than anything else in the swing-store, and comprise about 97% of the total space.

As of today (29-Mar-2024), the fully-VACUUM'ed SQLite DB is 147 GB, growing at about 1.1 GB/day. There are 38M transcript items, whose total size (sum(length(item))) is 116 GB.

For normal chain operations, we only need the "current transcript span" for each vat: enough history to replay the deliveries since the last heap snapshot. That is enough to bring a worker online, to the same state it was at at the last commit point. Our current snapshotInterval = 200 configuration means there will never be more than 200 deliveries in the current span (one span per vat), so they will be fairly small. The total size of all six thousand -ish current transcript items is a paltry 16 MB. A pruned version of today's swingstore DB would be about 4.6 GB in size.

However, when we first launched the chain, we were concerned that we might find ourselves needing to replay the entire incarnation, from the very beginning, perhaps as a last-ditch way to upgrade XS. We decided to have cosmic-swingset configure the swing-store to retain all transcript spans, not just the current one.

I think it's no longer feasible to retain that much data. We carefully designed the swing-store to keep hashes of all historical spans, even if we delete the span data itself, so we retain the ability to safely re-install the historical data (i.e. with integrity, not relying upon the data provider for correctness, merely availability). So in a pinch, we could find a way to distribute the dataset to all validators and build a tool to restore their databases (perhaps one vat at a time).

Description of the Design

We need a user-accessible switch to control whether a node retains historical transcripts or not. The swingstore constructor call takes an option to control this, but it isn't plumbed into e.g. app.toml.

Then, we need to configure two or more archive nodes to retain their historical transcripts, so we'll have data to recover from should we ever need it. All existing nodes have that data (and it is currently included in state-sync snapshots), so mainly we need to have at least two reliable existing nodes not change their configuration to prune the old spans.

Then, really, we should build and test some tooling to:

Then, either we change the default setting to prune the historical transcripts, or we tell all validators that they can save 90% of their disk space by changing the setting and let them make the decision.

This is closely related to what artifacts we put in state-sync snapshots. Currently we put all historical artifacts in those snapshots, and require them to all be present when restoring from a snapshot. We would need to change both sides: omit the historical spans during export, and stop requiring them during import (each of which is probably a one-line change to the swingstore API options bag). As a side-effect, state-sync snapshots would become a lot smaller, and would take less time to export and to import.

Security Considerations

The hashes we use on the transcript spans mean there are no integrity considerations. However, this represents a significant availability change.

We don't know that we'll ever need to replace-since-incarnation-start, and we don't know that we could afford to do so anyways:

We think, but have not implemented or tested, that we can restore this data later, given the (incomplete) plan above. We don't know how hard it will be to implement that plan, or to practically deliver the replacement artifacts. How large will they be? Will we need to deliver all of them, or just for a few vats? How can anyone get sufficiently up-to-date? We might have a situation where the chain halts for upgrade, and all validators must fetch the last few spans from an archive node before they can restart their nodes, introducing extra delays into an upgrade process that's already complicated (if we're resorting to such a big replay).

But we do know that the space this consumes is considerable, and growing fast. I'm really starting to think that we can't afford to have all nodes keep all that data anymore, and to hope/rely-upon the work we've done being sufficient to restore the data in the (unlikely?) event that we ever need it.

cc @mhofman @ivanlei for decisions

Scaling Considerations

Once deployed, this will remove about 1.0 GB per day from the disk-space growth of a mainnet validator (more, if transaction volume increases, e.g. because new price-feed oracles are deployed). If/when a validator does a "state-sync prune" (where they restore from a state-sync snapshot), they'll get a one-time reduction of 152 GB from their disk usage. The resulting swingstore should be about 4.6 GB, and will grow at about 34 MB per day. (The cosmos DB will be unaffected, and is currently about 182 GB, depending upon pruning and snapshot settings).

Test Plan

I believe @mhofman 's integration tests will exercise the state-sync export and import parts. I think we should have manual tests that an archive node will retain the data we care about.

Upgrade Considerations

We must decide whether the prune-historical-spans behavior is the default for all nodes (and have archive nodes configure themselves to retain those spans), or if retain-historical-spans is the default (and make sure validators know how to prune if desired). If we choose the former, then upgrade automatically makes things somewhat better (reduces the growth rate). A state-sync refresh/reload/pruning would still be necessary to shed the bulk of the data.

warner commented 5 months ago

A few more ideas:

mhofman commented 5 months ago

I only skimmed through for now, but wanted to capture a couple thoughts:

warner commented 5 months ago

Agreed, although one benefit of storing the old spans in a (separate) SQLite DB is that commit means commit: SQLite ensures the data will be properly flushed to disk, and an ill-timed power failure won't threaten it. If we use discrete files, we ought to do our own fflush() or equivalent, which is a drag. OTOH, that would certainly make it easier to publish, maybe as easy as pointing a plain webserver at the directory and throwing a CDN in front of it.

warner commented 5 months ago

One note, the options.keepTranscripts we pass into makeSwingStore currently controls whether old transcript items are deleted during rolloverSpan or rolloverIncarnation. It defaults to true, but we probably want to set it to false to achieve the goal of this ticket.

The tricky part is that it uses a single SQL statement, DELETE FROM transcriptItems WHERE vatID = ? AND position < ?, to delete everything older than the start of the current span. The first time that is run on a DB with a lot of history, it is going to delete a lot of items, and that's a problem: a quick test on our largest mainnet vat (v43-walletFactory), with 8.2M items as of last week, took two full seconds on a fast machine to just count the items. It took 22 minutes to delete them all, and the statement created a 27 GiB .wal file (to hold the uncommitted txn) while it ran.

As part of #8928 I'm adding a delete-a-little-at-a-time API to the swing-store, but it's aimed at vat deletion: there's not an obvious way to incorporate it into rolloverSpan/rollverIncarnation. That would leave us in an uncomfortable position: anyone who had done a state-sync prune of their node would be fine, but anyone who still has the original data would experience multiple massive stalls (and a 10-20% disk-usage spike) some random number of blocks after the upgrade which switches to keepTranscripts: false, as the long-history vats hit the end of their snapInterval=200 deliveries cycles and trigger a span rollover. Every vat has a pretty long history right now, so this would happen a lot, until all those old items finished being deleted.

One option is to change rolloverSpan to only delete the previous span's items, not all earlier items: we already have the startPos, endPos from that span, so we could change the DELETE to bound position on both sides. That would achieve the goal of flattening out the item growth without also incurring a gigantic deletion event. We'd wind up with a sparse transcript: populated items for spans 0..X, then missing items for spans X+1..CURRENT-1, then populated items for span CURRENT.

Then we'd need to decide what to do about rolloverIncarnation. If we simply did the change above, we'd have the same sparseness/gaps, which isn't the worst situation to be in.

A deeper fix would be to change swingstore to have a cleanup(budget) API, which the host would call at some moderate fixed rate (maybe one cleanup(5) call each block). There's a tricky question of consensus, though. The new #8928 APIs (transcriptStore.deleteVatTranscripts(vatID, budget=5)) affect consensus state because they delete the span records themselves, with hashes, which are shadowed into IAVL via the export-data: we aren't just de-populating the items, we're forgetting about the old spans completely. The kernel calls them for the vats that it knows have been terminated but not yet fully-deleted, so every block makes a small in-consensus change that deletes some DB data.

But the population status of transcript items is not part of consensus, partially to allow different validators to make different space-vs-replay-hassle decisions. To rate-limit the deletion of items for terminated vats, I'm having the kernel delete a budget-limited number of spans each block, and then the swingstore deletes both those span records (which will always already be present) and their transcript items (which may or may not be populated).

Perhaps the way to go is for rolloverSpan to delete all the span records right away (40k for that largest vat, maybe 100ms to execute, although it does mean 40k IAVL deletions too), and then have a non-consensus-changing swing-store cleanup(budget) method which is allowed to delete any transcript item that does not fit into a span record. I'm not sure how to make that efficient.. the most general case would allow a patchwork of spans, and we'd delete one item at a time, with a DB query for each one like SELECT COUNT(*) FROM transcriptSpans WHERE vatID=? AND startPos>=? AND endPos<? to see if it's retained or not. And it would have to start by getting a list of vatIDs, so it could iterate through each one's items separately. I don't really want to change the schema for this (ie adding a list of ranges of items that are known to not have span records, and which can be deleted), but we'd be within our rights to have the swingstore keep some state in RAM to speed things up, since it doesn't matter which unreferenced items get deleted (different validators, with different reboot histories, are allowed to delete different items). So maybe at swingstore startup, or the first time that cleanup() is called, we scan for all vatIDs, find the ranges of populated items for each (maybe we assume that we get two contiguous ranges: one for the current span, then possibly a second for the not-yet-deleted historical ones). Then in RAM we track those historical ranges, which would provide an easy way to pick off 100 at a time without even doing any additional DB queries.

warner commented 5 months ago

@mhofman and I decided:

Nodes which never change their keepTranscripts mode (e.g. archive nodes always have true, nodes launched from state-sync export always have false) will get obvious behavior: keep everything, or never have (and never generate) anything.

Nodes which transition from one mode to another (existing nodes that change their app.toml and restart, or nodes which are started from a non-state-sync "community snapshot" / raw dump but which edit their app.toml to drop old spans) will observe their growth rates go mostly flat (as we stop accumulating old spans), but will not shed any old data. To get rid of the old data, they must either do a state-sync prune, or some manual /usr/bin/sqlite3 CLI hacks.

Note that state-sync prune will get easier in more recent cosmos-sdk versions (maybe 0.47??), which introduces the ability to state-sync export to a local directory, and to import from the same, instead of only using the P2P network protocol (and thus depending upon some other node to publish their snapshot).

warner commented 5 months ago

We also sketched out the rest of the tools that we can build later to support the creation/consumption of historical spans:

The consumers of this data are going to be validators / RPC nodes / followers who have seen a forum post that says we'll be doing a whole-incarnation replay of vatID v43 on some date a few weeks from now. To avoid significant downtime, we need to pre-fetch and pre-execute as much of that replay as possible. So at that point: