Open warner opened 1 year ago
I'm escalating the priority of this one. I recomputed the current sizes and growth rates on mainnet. We're currently adding 200MB per day of transcript items (this would reduce that to maybe 20MB/day), and as of 21-sep-2023 we're up to 18.3GB of transcript items. I'm hopeful we can get this into the upgrade12 chain-software upgrade.
If necessary, we could implement this without #8089. We would add a new table for the compressed spans (with a IF NOT EXISTS
clause in the creation code), keep all the other tables the same, and make sure that "not compressed" means "not present in (the new table)".
Since we already made a schema change in #8075, without proper versioning/upgrading flows, we currently have two different variants (in the field) for the implicitly-named "version 1" of the SQL schema (v1a
has the CHECK constraint on snapshots
, v1b
does not). The #8089 plan is to start with an upgrader that merges these two variants, so that DBs are always upgraded to a v2
(which always lacks the CHECK constraint). Upgrading v1b
to v2
effectively only adds the version
field. Updating v1a
to v2
removes the CHECK constraint. Both upgrade processes must do the whole copy-to-temp-table-and-rename dance, but the resulting rows will be the same.
Implementing span compression without #8089 would introduce a third variant: v1c
would have the extra table. Then when we do implement #8089, we'd add an upgrader that merges all three variants, so v2
always has the extra table, and never has the CHECK constraint.
If necessary, we could implement this without #8089
There is another alternative that doesn't require versioning: on open, always attempt an opportunistic migration: check if table exists, if it doesn't create and move historical spans to it.
Yeah, that works for the same basic reason that creating another variant of v1
would: these variant schema all interpret existing data the same way. We could get away with v1b
because only the import code cares about the lack of the CHECK constraint (runtime operations always add complete records, it's only during import that we populate the hash first and the data later). It's not easy to sense the difference between v1a
and v1b
: we could read the CREATE TABLE
statement's text from the SQLite meta-table, and grep for CHECK
, but it would be fragile. That's why my plan for v2 is to not bother sensing the difference and just always do the migration.
The real driver for proper schema migration will be when the same row needs to be interpreted different ways depending on the version, or when we need to use different SQL statements to work with the different versions. Adding a compressed-span table is pretty simple compared to that, and a single SQL statement would suffice to add the table regardless of initial state. But, we would wind up with something like v2a
(if we added the compressed-span table to a v1a
DB that had the CHECK constraint), and v2b
(if we added it to a v1b
that lacked the constraint). So we'd be kicking the "merge existing variants into a single fully-defined schema" down the road by one upgrade step.
For CHECK I agree, but for my suggested change we are actually performing a migration, just not one driven by a schema version, but by the presence of a table or not (or it could be the presence of a column if we can easily test the schema for a column).
I agree were kicking the problem of the CHECK down the road, but until we need to solve it (e.g. for background compression of heap snapshots) we don't need to require versioning for migrations. At least I don't think we do in this case.
or it could be the presence of a column if we can easily test the schema for a column
(for future reference): our two options for that test are:
CREATE TABLE
statement text from it, parse it according to SQL rules and/or grep for the column nameSELECT
which references the column name, catch the error (false), or ignore a successful result (true)
What is the Problem Being Solved?
I ran some statistics on a recent copy of the mainnet
swingstore.sqlite
with the following SQL statements:As of blockHeight 11415809 (around 29-aug-2023), I get:
The whole file is 17.6 GB in size, so we've also got like 3.5GB of SQL index/b-tree overhead.
That's awfully big. The cosmos state is large too (about 2x that size), but still.
We're keeping current-incarnation-but-historical-span transcript items around just in case we need to do an xsnap upgrade by replaying the whole incarnation, but normal operation doesn't need random-access to them. Back when we moved everything into SQLite, I remember wanting
rolloverSpan()
to concatenate and compress all the old span's items into a single DB row (either a column in thetranscriptSpans
table, or a new table calledcompressedTranscriptSpans
), but we decided it wasn't necessary at the time.I think we should reconsider that. The transcript items are really fluffy, and I bet we'd get 10x compression on them. Plus, we only have 24_201 spans (of which only 56 are current: one per vat), so we'd remove nearly five million rows from the DB, which would shrink the overhead considerably. This wouldn't speed up normal operation. It would slow down
rolloverSpan()
slightly (concatating and compressing takes time, unless we cooked up a scheme to do it in the background). It would slow down the xsnap replay case by some tiny amount (probably swamped by the actual xsnap execution time).Description of the Design
The simplest approach would be to compress during
rolloverSpan()
, and make a separate table for the compressed spans (having done a couple of hasty and perhaps unwisesqlite3 -box swingstore.sqlite 'SELECT * FROM tablename'
orsqlite3 swingstore.sqlite .dump tablename
commands, while debugging something, I'm starting to value putting the really large blobs/strings in their own table):transcriptCompressedSpans
table, with(vatID, startPos, endPos, compressedSpan)
columnsrolloverSpan()
causes a[startPos,endPos)
range of items to fall out ofinUse
:SELECT * FROM transcriptItems WHERE vatID=? AND startPos >= ? AND endPos < ? ORDER BY position
INSERT
the compressed blobINTO transcriptCompressedSpans
DELETE FROM transcriptItems WHERE vatID=? AND startPos >= ? AND endPos < ?
Then change
readSpan
to checktranscriptCompressedSpans
for a compressed record: if present, decompress it, split on newlines, iterate through the result. If not present, look for the (loose) items intranscriptItems
. Any given span might be compressed, uncompressed, or pruned, and all APIs must handle all cases.We'd also need to change
assertComplete
to accept the presence of a compressed span, rather than insisting on the items being present intranscriptItems
.(We already have code to build the newline-concatenated string, since that what the span artifacts look like.)
We'd need to change the export and import code as well: exports of compressed spans will do a decompression but not any concatenation, and imports of non-current spans will do a compression in addition to a split (the compressed data gets jammed directly into the
transcriptCompressedSpans
table, but of course it must still recompute the cumulative hash and verify it against the span record).compressing the existing data
The trickiest part is when/how to compress the pre-existing records, since it will take so long. It won't affect the consensus state, though, so it can be done in the background. The commit point is important, though: we can't interrupt the real work, and we can't accidentally cause the real work to get committed too early.
One option is to use a separate DB connection to perform the compression work, like we do for
makeSwingStoreExporter
. But it's critical that we don't break the real work, which is done on a transaction that is opened (in IMMEDIATE mode) as soon as the first user ofhostStorage
orkernelStorage
calls an API that needs to write to the DB. Once that txn is opened, no other connection can open a write txn. The window of contention starts when the host calls a device input (which probably changes device state, or pushes something onto the run queue), and ends aftercontroller.run()
finishes and the host callshostStorage.commit()
. So a separate DB connection could only be used between blocks (i.e. in the "voting time"), and it would need to finish up (and commit) before the next work cycle began.Let's avoid that (we probably want to stick with a single writable connection anyways, just for simplicity).
Instead, we could do the work with the same DB connection, as a method on
hostStorage
(next tocommit
). I'm thinking we addp = hostStorage.doBackgroundWork(pollfn)
. Thepollfn
would behave a lot like therunPolicy
you can pass tocontroller.run()
: it gets consulted after each chunk of work to see whether it ought to continue or not. ThePromise
returned bydoBackgroundWork
fires shortly afterpollfn
returnedfalse
or the swingstore discovered there is no work left to do.The host could do something like:
I bet @mhofman has some ideas here.
I'm sure the details are tricky. It would be great if we could somehow tell whether we're catching up with the chain or not (compare current blockTime against wallclock time?), and do less or no background work until we're caught up. Maybe, for any given launch of the host, set a flag that says "we're probably still catching up", after each block is done we start a 2s timer, if the timer expires before the next block starts then clear the flag (enabling background work), if the block starts before the timer expires then cancel the timer and leave the flag set.
Another option is to only perform the work during a chain upgrade, and wait for it to complete before executing the first post-upgrade block. I'll see if I can estimate how much CPU time we're talking about here.. gzipping a 15GB file would take several minutes at best, even on an infinitely fast/parallelized CPU, because that's a lot of data to pull in off the disk (and it's probably scattered pretty badly over the surface). On the plus side, we could probably build an API to provide decent progress information (count the number of non-inUse spans, count the number of compressed spans, assume they all need compression, add a
progressCallback
option that is called with a few numbers after every chunk of work). If we're going to ask validators to tolerate a 5-minute delay at upgrade time, we should give them a progress bar. Also, we should commit every once in a while (so compression progress isn't lost if they get impatient and reboot).If we go this way, we could also consider providing a standalone tool to perform the compression, if/when the validator decided the recovered disk space was worth some downtime. Such a tool could also do a VACUUM of the DB afterwards, because the actual file on disk won't shrink without one (freeing up all that space will give SQLite a huge bunch of free pages, so the file wouldn't grow beyond the starting 17.6 GB size for months or years, until all those free pages were consumed). Even if we do the compression work automatically or in the background, we might want to let validators know how to run a VACUUM afterwards, to concretely/immediately benefit from the reduced space.
We could also consider never attempting to catch up: compress new spans, but leave the old ones unpacked. That would reduce the swingstore growth rate significantly (probably 10x), but wouldn't reduce the size of existing data. But.. a state-sync import would start out with compressed spans. So instead of figuring out a background compression scheme, we just tell validators who want that 17GB back to restore from a state-sync snapshot. And that skips the need for a VACUUM too, since they'll be creating a brand-new SQLite DB, so it's container file will be minimally sized.
I think I like this last option, it would be waaay easier.
Security Considerations
Compression doesn't invalidate the cumulative hash we use, and the compressed spans are hash-verified to match the expectations established by the (trusted) export-data records during import. So nothing about compression should threaten the integrity properties of the old transcript items.
We must make sure our "background" scheduling of the compression work doesn't threaten real work: getting that wrong could cause a validator to crash as they try to use a DB that's locked.
The compression code we choose (Node's built-in
zlib
module, which almost certainly wraps the standardlibzlib
) will be processing the contents of transcript items, which include attacker-controlled strings (method arguments orvatstoreSet
values). Ifzlib
has a memory vulnerability, we'd be exposing that to the attacker. However, we're already doing that by virtue of compressing heap snapshots (which also contain attacker-controlledUint8Array
buffers and strings), so I don't think this is any more vulnerable than before.Scaling Considerations
This is all about improving the scalability: reducing the storage requirements for validators and follower nodes by perhaps 25% (and the other 75% is from cosmos and tendermint, and can be reduced by periodic state-sync or block/txn pruning settings). The tradeoff is somewhat increased execution time (to compress recently-generated data as it falls out of the current span), and the need for a large chunk of time to catch up on the compression of all that pre-existing data.
Test Plan
Swingstore will have unit tests to demonstrate that:
rolloverSpan()
creates a new oneisComplete
returns the same value whether a span is compressed or notpollFn
thingSome of these tests may require manually constructing the DB to read from (i.e. non-compressed historical spans, which the new code would not normally generate).
Then we'll need tests on the cosmic-swingset -side code which drives the catch-up work, if we choose to build it.
Upgrade Considerations
The swing-store SQL schema will change: the new store will have an extra table, or perhaps an extra column on the
transcriptSpans
table. If we go with an extra table, we can use our existingCREATE TABLE IF NOT EXISTS
approach to automatically add the table on the first launch with the new version of@agoric/swing-store
. If we use a new column on the existing table, we must first implement the #8089 schema upgrade/migration plan.The schema changes are one-way: after running with the newer version, you cannot revert to an older version. The older code would not look in the new
transcriptCompressedSpans
table, so it would think that the current-incarnation historical spans are missing entirely, which would cause state-sync exports to fail (theisComplete
check would throw). And, if we do wind up performingxsnap
upgrades with full-incarnation replays, those replays would fail to find the early transcript entries they need.