swingstore should compress old transcript spans

warner commented 1 year ago

What is the Problem Being Solved?

I ran some statistics on a recent copy of the mainnet swingstore.sqlite with the following SQL statements:

SELECT COUNT(bundle) FROM bundles;
SELECT SUM(length(bundle)) FROM bundles;

SELECT COUNT(hash) FROM snapshots; -- all rows, even pruned, have a hash
SELECT COUNT(compressedSnapshot) FROM snapshots; -- NULL when pruned
SELECT SUM(length(compressedSnapshot)) FROM snapshots;

SELECT COUNT(key) FROM kvStore;
SELECT SUM(length(key) + length(value)) FROM kvStore;

SELECT COUNT(item) FROM transcriptItems;
SELECT SUM(length(item)) FROM transcriptItems;

As of blockHeight 11415809 (around 29-aug-2023), I get:

store	entries (full)	entries (pruned)	size
bundles	40	(same)	30 MB
snapshots	24_145	55	93 MB
kvStore	4_106_729	(same)	293 MB
transcriptItems	4_869_688	(same)	13.7 GB

The whole file is 17.6 GB in size, so we've also got like 3.5GB of SQL index/b-tree overhead.

That's awfully big. The cosmos state is large too (about 2x that size), but still.

We're keeping current-incarnation-but-historical-span transcript items around just in case we need to do an xsnap upgrade by replaying the whole incarnation, but normal operation doesn't need random-access to them. Back when we moved everything into SQLite, I remember wanting rolloverSpan() to concatenate and compress all the old span's items into a single DB row (either a column in the transcriptSpans table, or a new table called compressedTranscriptSpans), but we decided it wasn't necessary at the time.

I think we should reconsider that. The transcript items are really fluffy, and I bet we'd get 10x compression on them. Plus, we only have 24_201 spans (of which only 56 are current: one per vat), so we'd remove nearly five million rows from the DB, which would shrink the overhead considerably. This wouldn't speed up normal operation. It would slow down rolloverSpan() slightly (concatating and compressing takes time, unless we cooked up a scheme to do it in the background). It would slow down the xsnap replay case by some tiny amount (probably swamped by the actual xsnap execution time).

Description of the Design

The simplest approach would be to compress during rolloverSpan(), and make a separate table for the compressed spans (having done a couple of hasty and perhaps unwise sqlite3 -box swingstore.sqlite 'SELECT * FROM tablename' or sqlite3 swingstore.sqlite .dump tablename commands, while debugging something, I'm starting to value putting the really large blobs/strings in their own table):

add a transcriptCompressedSpans table, with (vatID, startPos, endPos, compressedSpan) columns
when rolloverSpan() causes a [startPos,endPos) range of items to fall out of inUse:
- newline-concatenate all of SELECT * FROM transcriptItems WHERE vatID=? AND startPos >= ? AND endPos < ? ORDER BY position
- zlib/gzip-compress that string
- INSERT the compressed blob INTO transcriptCompressedSpans
- DELETE FROM transcriptItems WHERE vatID=? AND startPos >= ? AND endPos < ?

Then change readSpan to check transcriptCompressedSpans for a compressed record: if present, decompress it, split on newlines, iterate through the result. If not present, look for the (loose) items in transcriptItems. Any given span might be compressed, uncompressed, or pruned, and all APIs must handle all cases.

We'd also need to change assertComplete to accept the presence of a compressed span, rather than insisting on the items being present in transcriptItems.

(We already have code to build the newline-concatenated string, since that what the span artifacts look like.)

We'd need to change the export and import code as well: exports of compressed spans will do a decompression but not any concatenation, and imports of non-current spans will do a compression in addition to a split (the compressed data gets jammed directly into the transcriptCompressedSpans table, but of course it must still recompute the cumulative hash and verify it against the span record).

compressing the existing data

The trickiest part is when/how to compress the pre-existing records, since it will take so long. It won't affect the consensus state, though, so it can be done in the background. The commit point is important, though: we can't interrupt the real work, and we can't accidentally cause the real work to get committed too early.

One option is to use a separate DB connection to perform the compression work, like we do for makeSwingStoreExporter. But it's critical that we don't break the real work, which is done on a transaction that is opened (in IMMEDIATE mode) as soon as the first user of hostStorage or kernelStorage calls an API that needs to write to the DB. Once that txn is opened, no other connection can open a write txn. The window of contention starts when the host calls a device input (which probably changes device state, or pushes something onto the run queue), and ends after controller.run() finishes and the host calls hostStorage.commit(). So a separate DB connection could only be used between blocks (i.e. in the "voting time"), and it would need to finish up (and commit) before the next work cycle began.

Let's avoid that (we probably want to stick with a single writable connection anyways, just for simplicity).

Instead, we could do the work with the same DB connection, as a method on hostStorage (next to commit). I'm thinking we add p = hostStorage.doBackgroundWork(pollfn). The pollfn would behave a lot like the runPolicy you can pass to controller.run(): it gets consulted after each chunk of work to see whether it ought to continue or not. The Promise returned by doBackgroundWork fires shortly after pollfn returned false or the swingstore discovered there is no work left to do.

The host could do something like:

...
await swingStore.commit(); // finish previous block
let keepGoing = true;
let stopAfter = Date.now() + duration; // maybe 3 seconds, to fit in 5s voting time
const pollfn = () => keepGoing && Date.now() < stopAfter;
const backgroundDone = hostStorage.doBackgroundWork(pollfn); // note: no await
await goDoVotingStuff(); // takes arbitrary time, but usually 5s
// voting is done, stop background work
keepGoing = false;
await backgroundDone; // should fire soon, since no new work will be started
processInputs();
await controller.run();
await swingStore.commit(); // finish new block
... repeats

I bet @mhofman has some ideas here.

I'm sure the details are tricky. It would be great if we could somehow tell whether we're catching up with the chain or not (compare current blockTime against wallclock time?), and do less or no background work until we're caught up. Maybe, for any given launch of the host, set a flag that says "we're probably still catching up", after each block is done we start a 2s timer, if the timer expires before the next block starts then clear the flag (enabling background work), if the block starts before the timer expires then cancel the timer and leave the flag set.

Another option is to only perform the work during a chain upgrade, and wait for it to complete before executing the first post-upgrade block. I'll see if I can estimate how much CPU time we're talking about here.. gzipping a 15GB file would take several minutes at best, even on an infinitely fast/parallelized CPU, because that's a lot of data to pull in off the disk (and it's probably scattered pretty badly over the surface). On the plus side, we could probably build an API to provide decent progress information (count the number of non-inUse spans, count the number of compressed spans, assume they all need compression, add a progressCallback option that is called with a few numbers after every chunk of work). If we're going to ask validators to tolerate a 5-minute delay at upgrade time, we should give them a progress bar. Also, we should commit every once in a while (so compression progress isn't lost if they get impatient and reboot).

If we go this way, we could also consider providing a standalone tool to perform the compression, if/when the validator decided the recovered disk space was worth some downtime. Such a tool could also do a VACUUM of the DB afterwards, because the actual file on disk won't shrink without one (freeing up all that space will give SQLite a huge bunch of free pages, so the file wouldn't grow beyond the starting 17.6 GB size for months or years, until all those free pages were consumed). Even if we do the compression work automatically or in the background, we might want to let validators know how to run a VACUUM afterwards, to concretely/immediately benefit from the reduced space.

We could also consider never attempting to catch up: compress new spans, but leave the old ones unpacked. That would reduce the swingstore growth rate significantly (probably 10x), but wouldn't reduce the size of existing data. But.. a state-sync import would start out with compressed spans. So instead of figuring out a background compression scheme, we just tell validators who want that 17GB back to restore from a state-sync snapshot. And that skips the need for a VACUUM too, since they'll be creating a brand-new SQLite DB, so it's container file will be minimally sized.

I think I like this last option, it would be waaay easier.

Security Considerations

Compression doesn't invalidate the cumulative hash we use, and the compressed spans are hash-verified to match the expectations established by the (trusted) export-data records during import. So nothing about compression should threaten the integrity properties of the old transcript items.

We must make sure our "background" scheduling of the compression work doesn't threaten real work: getting that wrong could cause a validator to crash as they try to use a DB that's locked.

The compression code we choose (Node's built-in zlib module, which almost certainly wraps the standard libzlib) will be processing the contents of transcript items, which include attacker-controlled strings (method arguments or vatstoreSet values). If zlib has a memory vulnerability, we'd be exposing that to the attacker. However, we're already doing that by virtue of compressing heap snapshots (which also contain attacker-controlled Uint8Array buffers and strings), so I don't think this is any more vulnerable than before.

Scaling Considerations

This is all about improving the scalability: reducing the storage requirements for validators and follower nodes by perhaps 25% (and the other 75% is from cosmos and tendermint, and can be reduced by periodic state-sync or block/txn pruning settings). The tradeoff is somewhat increased execution time (to compress recently-generated data as it falls out of the current span), and the need for a large chunk of time to catch up on the compression of all that pre-existing data.

Test Plan

Swingstore will have unit tests to demonstrate that:

the current span gets compressed after rolloverSpan() creates a new one
access to historical spans works the same whether the span is compressed or not
isComplete returns the same value whether a span is compressed or not
export gets the same artifact contents and hashes whether the span is compressed or not
import creates compressed (historical) spans, not unpacked items
whatever "catch up on compression" API we choose works, and is limited by the pollFn thing
corrupt historical artifacts are rejected during import, even though we're no longer storing the split items

Some of these tests may require manually constructing the DB to read from (i.e. non-compressed historical spans, which the new code would not normally generate).

Then we'll need tests on the cosmic-swingset -side code which drives the catch-up work, if we choose to build it.

Upgrade Considerations

The swing-store SQL schema will change: the new store will have an extra table, or perhaps an extra column on the transcriptSpans table. If we go with an extra table, we can use our existing CREATE TABLE IF NOT EXISTS approach to automatically add the table on the first launch with the new version of @agoric/swing-store. If we use a new column on the existing table, we must first implement the #8089 schema upgrade/migration plan.

The schema changes are one-way: after running with the newer version, you cannot revert to an older version. The older code would not look in the new transcriptCompressedSpans table, so it would think that the current-incarnation historical spans are missing entirely, which would cause state-sync exports to fail (the isComplete check would throw). And, if we do wind up performing xsnap upgrades with full-incarnation replays, those replays would fail to find the early transcript entries they need.

### Depends Upon
- [ ] #8089

warner commented 1 year ago

I'm escalating the priority of this one. I recomputed the current sizes and growth rates on mainnet. We're currently adding 200MB per day of transcript items (this would reduce that to maybe 20MB/day), and as of 21-sep-2023 we're up to 18.3GB of transcript items. I'm hopeful we can get this into the upgrade12 chain-software upgrade.

warner commented 1 year ago

If necessary, we could implement this without #8089. We would add a new table for the compressed spans (with a IF NOT EXISTS clause in the creation code), keep all the other tables the same, and make sure that "not compressed" means "not present in (the new table)".

Since we already made a schema change in #8075, without proper versioning/upgrading flows, we currently have two different variants (in the field) for the implicitly-named "version 1" of the SQL schema (v1a has the CHECK constraint on snapshots, v1b does not). The #8089 plan is to start with an upgrader that merges these two variants, so that DBs are always upgraded to a v2 (which always lacks the CHECK constraint). Upgrading v1b to v2 effectively only adds the version field. Updating v1a to v2 removes the CHECK constraint. Both upgrade processes must do the whole copy-to-temp-table-and-rename dance, but the resulting rows will be the same.

Implementing span compression without #8089 would introduce a third variant: v1c would have the extra table. Then when we do implement #8089, we'd add an upgrader that merges all three variants, so v2 always has the extra table, and never has the CHECK constraint.

mhofman commented 1 year ago

If necessary, we could implement this without #8089

There is another alternative that doesn't require versioning: on open, always attempt an opportunistic migration: check if table exists, if it doesn't create and move historical spans to it.

warner commented 1 year ago

Yeah, that works for the same basic reason that creating another variant of v1 would: these variant schema all interpret existing data the same way. We could get away with v1b because only the import code cares about the lack of the CHECK constraint (runtime operations always add complete records, it's only during import that we populate the hash first and the data later). It's not easy to sense the difference between v1a and v1b: we could read the CREATE TABLE statement's text from the SQLite meta-table, and grep for CHECK, but it would be fragile. That's why my plan for v2 is to not bother sensing the difference and just always do the migration.

The real driver for proper schema migration will be when the same row needs to be interpreted different ways depending on the version, or when we need to use different SQL statements to work with the different versions. Adding a compressed-span table is pretty simple compared to that, and a single SQL statement would suffice to add the table regardless of initial state. But, we would wind up with something like v2a (if we added the compressed-span table to a v1a DB that had the CHECK constraint), and v2b (if we added it to a v1b that lacked the constraint). So we'd be kicking the "merge existing variants into a single fully-defined schema" down the road by one upgrade step.

mhofman commented 1 year ago

For CHECK I agree, but for my suggested change we are actually performing a migration, just not one driven by a schema version, but by the presence of a table or not (or it could be the presence of a column if we can easily test the schema for a column).

I agree were kicking the problem of the CHECK down the road, but until we need to solve it (e.g. for background compression of heap snapshots) we don't need to require versioning for migrations. At least I don't think we do in this case.

warner commented 1 year ago

or it could be the presence of a column if we can easily test the schema for a column

(for future reference): our two options for that test are:

1: access the SQlite meta-table, retrieve the CREATE TABLE statement text from it, parse it according to SQL rules and/or grep for the column name
2: attempt to do a small SELECT which references the column name, catch the error (false), or ignore a successful result (true)

Agoric / agoric-sdk