Implementing snapshotters of consensus data for SwingSet state sync

What is the Problem Being Solved?

Refs: #3769

We need a simple but upgradable mechanism to publish snapshots of SwingSet state that are rolled up into the AppHash so that they can be verified within consensus.

Description of the Design

The ExtensionSnapshotter is an interface for external data to be snapshotted and published. CosmWasm's wasmd uses it for external zip files.

There are three ExtensionSnapshotters we will partition the job into:

VatTranscriptSnapshotter
- serves a snapshot with each vat's transcript in a separate chunk, with a format like (part-length part-data cumulative-hash)*
- stores the previous block's offset and cumulative hash value in the multistore indexed by vat ID
- restores a snapshot by processing and verifying the cumulative hash values as they stream in
KernelSnapshotter
- serves a snapshot with (to an approximation) each vat's data in a separate chunk
- streams its per-committed-crank changes to a Merklizer starting at BEGIN_BLOCK
  - sends a 'commit and read hash' message in COMMIT_BLOCK, with an optional flag to snapshot the content
  - writes the committed hash into the multistore before beginning new work in the next BEGIN_BLOCK
- restores a snapshot by repopulating both an empty Merklizer and an empty LMDB, verifying content proofs as they go, as well as the committed hash against the multistore
VatHeapSnapshotter (a later optimisation phase)
- serves a snapshot with chunks corresponding to the previous block's snapshot for each vat
- stores the content hash of the previous block's snapshots in the multistore indexed by vat ID
- restores a snapshot by writing the content to a content-addressable filename, verifying the hash in the multistore

This design can be implemented (albeit inefficiently) by using the multistore as the Merklizer. A later optimisation would be to have a (potentially external) Merklizer such as a pipe to a separate process containing Penumbra's Jellyfish Merkle Tree.

A key element of the design is to update the multistore only with hashes from the prior block. This takes advantage of available concurrency during the (slow) commit_timeout and voting process, as well as being reasonably prompt. Ideally, there would be very little additional overhead to generate and save the prior block's hashes, and the main cost would be opt-in to save the corresponding state snapshot periodically.

Until this design is fully implemented and tested, it should be kept behind an agd start command-line option.

Security Considerations

Improved security by putting more of our data under the aegis of consensus.

Test Plan

Performance testing will be key.

I was thinking about this today, before rereading the design above, and wondered if the following would work (maybe simpler, maybe less efficient).

My understanding of the State-Sync protocol

(from https://docs.tendermint.com/master/spec/abci/apps.html#state-sync and https://docs.tendermint.com/master/spec/abci/abci.html#state-sync)

the state-sync protocol is implemented by Tendermint, with support by the ABCI application
a "snapshot" is a list of sub-16MB chunks (bytestrings), plus a "snapshot record"
a "snapshot index" is a block height and a format ID/number
a "snapshot record" is (height, format, numChunks, "hash", and a metadata bytestring
- the "hash" is for the benefit of the app; tendermint doesn't do anything special with it
- the whole record must serialize to less than 4MB, which limits the metadata size
nodes aren't obligated to create/offer snapshots for any particular index, but any that do should offer the same one (there should be an imaginary global table mapping index to snapshot, with exactly one entry per index, and each node either hosts that one or it doesn't)
when a new node comes up, its Tendermint (if enabled) asks existing nodes for snapshots
- in existing nodes, that causes the Tendermint layer to send ListSnapshots over ABCI to the app, which returns a list of snapshots it has to offer. Then TM sends these offers back to the new node.
- in the new node, Tendermint gathers these offers, sorts them by internal preference criteria, and offers them over ABCI to the new app one at a time with OfferSnapshot
- the new app can accept or reject each one (with varying levels of prejudice)
- OfferSnapshot includes the verified AppHash, in case it helps with the decision
- if the app rejects a snapshot, it might not get another: it doesn't get the full list ahead of time
once the app accepts a snapshot, the new Tendermint asks all old nodes that hold that snapshot for chunks (by number)
- the old nodes' Tendermint uses LoadSnapshotChunk(snapshotIndex, chunkNum) over ABCI to fetch chunks from the old nodes' app layer
- the new node's Tendermint assembles the chunks into a tempdir, and then feeds them into the new app (in sequential order, no gaps) with ApplySnapshotChunk
- Tendermint does no validation of the chunks, leaving that up to the app
after the new app receives the last chunk, Tendermint uses the ABCI Info request to ask the app for LastBlockAppHash, which it compares against the light-client-verified chain value
at this point the app ought to be ready for new transactions, starting with the block height after the snapshot was taken

I don't see anything in the docs suggesting that the snapshot-record hash field (nor anything in the snapshot data) is compared against chain data by Tendermint. The config.toml file has a [statesync] section with a trust_hash= field: I'm guessing that means Tendermint will only accept snapshots with a matching hash. And it has rpc_servers= and trust_height= fields, which tells me that it doesn't "ask around", it only asks the named servers, and only for snapshots at the specific height. So I think state-sync can only be safely initiated manually by the owner of the new node, and their integrity relies upon whoever they chose to get a trust_hash= from.

Which.. makes our job somewhat easier: we don't need to find a way to put the snapshot hash into the verified chain state (which would 1: require all validators to build snapshots, at the same heights, so they have a consensus opinion about the contents, and 2: either require post-snapshot blocks to wait for the snapshot to be built, or do something funky like insert the hash of snapshot-for-block-100 into the state of block-200 by assuming 100 blocks is enough time for everyone to compute it, or else).

For integrity, trust_hash must cover the entire snapshot data contents. To prevent one bad chunk from poisoning the whole snapshot, we must be able to validate each chunk as it arrives (so ApplySnapshotChunk can reject it, so Tendermint can find a better copy). One option is for the chunks to include merkle proofs that trace up to the trust_hash. Another is for the snapshot record's metadata field to hold a flat list of hashes, one per chunk, and then trust_hash is a hash of that metadata field.

Applications could treat the snapshot as one big slab of data (with some internal structure to be parsed out), split into 10MB-ish chunks for transport efficiency and to meet the < 16MB serialized size rule. Or they could use different chunks for semantically different units of state.

cosmos-sdk certainly has opinions about these choices already, and is probably optimized for managing its IAVL tree. So our extensions must fit within the structures cosmos is already using.

Snapshotting Package

The snapshotting code either lives in cosmic-swingset or in a new cosmic-swing-store-snapshotter package
SwingSet does not know about snapshots. We enhance the packages/swing-store APIs to support snapshotting but it doesn't know about snapshots either.
The snapshotting code is given a persistent directory to work with. This contains subdirectories for each complete snapshot, and for each in-progress snapshot. The library can return a list of complete snapshots. The subdir names are chosen by the client (and should just be a block height).
When instructed to create a snapshot with a given name, the snapshotting code makes a tempdir and populates it with some set of files. Once complete, the tempdir is renamed to the requested name. At that point, the snapshotting code will report the existence of the snapshot, and can return numbered chunks and metadata fields on demand.

Simplest Possible Design

I think the simplest thing that could possibly work would do the following at snapshot creation time:

cosmic-swing-store-snapshotter creates snapshots synchronously, blocking the caller until complete
it iterates through all three portions of swing-store (kvStore, snapStore, streamStore) in turn
each iterator emits a stream of key/value pairs, both strings
- for kvStore, we skip the local.* keys that are specifically excluded from consensus
- for snapStore the key is the hash-based ID of the snapshot file, and the value is the uncompressed contents of the snapshot, since the on-disk format is gzip-compressed, and gzip adds a non-deterministic local timestamp
- for streamStore, the key is a composite of the vatID and the transcript index (starting at 0)
the key/value pair is turned into a pair of netstrings, which are appended to the snapshot file
a third netstring is appended with a simple command: one of "add to kvStore, add to snapStore, add to streamStore"
- this forms a simple stack-based DSL, sort of, where all operations take exactly two arguments
once complete, the snapshot file is broken up into 10MB chunks
- either we store each chunk into a separate file (and the chunk number is used to form its name), or we use a single huge file (and the chunk number indicates a chunkNum * 10MB offset within it)
each chunk is hashed, and a (relative chunk number, hash) pair is prepared
the swing-store chunks occupy some contiguous set of chunk numbers in the overall (cosmos-managed) snapshot
the swing-store per-chunk hash records are included in the overall snapshot metadata record
the metadata must also remember what range of chunks are for swing-store, and which go to cosmos

On snapshot-loading time (on a new empty node), cosmic-swing-store-snapshotter is first given the metadata hash records and an empty directory to build the new kernel's swing-store instance (specifically ~/.agoric/data/ag-cosmos-chain/`). Then it is fed chunks, in order. For each chunk, it hashes the chunk data and verifies it against the metadata. If this fails, the chunk is ignored, and the "please get a better one" error is returned (so Tendermint can try someone else for that chunk, without yet abandoning the entire snapshot).

If the chunk is accepted, the stream of netstrings are fed into a parser (remembering that individual netstrings might be split between chunks). As the individual key/value/command blobs emerge, they are processed (hold key, hold value, write to DB, forget both). When the last chunk is written, the caller should indicate completion, whereupon cosmic-swing-store-snapshotter flushes the DB writes and closes its handles to the DB. The caller can then use openSwingStore as if the data had been present the whole time.

Simple Improvements

The stream of netstrings should be deterministically compressed before being split into chunks. Both kvStore and streamStore are really fluffy text records that usually compress by about 3x, snapStore uncompressed XS heap snapshots compress by about 4.5x. This could use the standard zlib library (not /usr/bin/gzip, because timestamps, although gzip --no-name appears to suppress that, which would make debugging easier).

The first netstring record should be some sort of format indicator. From a DSL point of view, it is an opcode that instructs the entire machine to behave like a v1 machine, leaving the "behave like a v2 machine" opcode to be defined and implemented later.

The key/value records can be processed more efficiently by not including any type information (the output of decodeNetstring is exactly the bytes of the key or the value, not a "this is a key" type byte followed by the actual key, which might require an extra copy, or maybe JS TypedArray views can avoid that). This limits our v1 machine to strict triples of [key, value, opcode].

We might include a final netstring record to finalize the new swing-store, rather than relying upon the caller to tell us when the last chunk has arrived. Given the previous restriction, this would have to be encoded as ['', '', finalizeOpcode].

Drawbacks

iterating the entire DB might take too long to perform between blocks: the next block might be delayed by an uncomfortable period
- OTOH, it won't be validators creating snapshots, it will be a dedicated service, so maybe it's ok if they fall behind for a few minutes while they build the snapshot
creating a later snapshot will take as much time as creating the first one; no information or work is shared between them
clients can only install a full snapshot: the format does not enable e.g. fetching+validating just a single vat's worth of data

These don't sound too bad to me. The blocking time could be reduced by making a clever copy of the data quickly, then doing the dump slowly later (maybe an OS-provided copy-on-write filesystem scheme, or we do something with LMDB to get a read snapshot transaction and hold onto it until the kvStore scan is complete). Snapshots are usually taken far enough apart (once a day?) that there won't be a huge amount of overlap between them, reducing the potential rewards of a more incremental approach. And if/when we figure out how to take advantage of more fine-grain-edly-verifiable data, we can create a new snapshot format that breaks the merkle tree out into a better shape.

XS Heap Snapshots

To make this worthwhile, we really need to include XS heap snapshots in the state-sync snapshot. That means:

move the local.snapshot.${snapshotID} = [vatsUsingThatSnapshot] keys out of local. and into consensus
move the local.${vatID}.lastSnapshot = { snapshotID, startPos } keys out of local. and into consensus
hope that XS heap snapshots are really deterministic now

I just read through the ADR you linked and some of the comments around the design. I see that the cosmos-sdk hooks for registering snapshotters is how cosmos "expresses its opinion" about the format of the Tendermint snapshot data. I'm still trying to get my head around how that interacts with the IAVL tree and/or the multistore, and how the swingset data wants (needs?) to have a hash commitment in the cosmos state vector.

Does the ADR's "for determinism we require that the hash of the external data should be posted in the IAVL tree" statement mean that all nodes (validators in particular) must compute the snapshot components, so they can have consensus on the contents? And does that imply that those components are continually hashed into the IAVL state, so that there isn't a big synchronous delay when a snapshot is made? And does that statement suggest that they expect clients of state-sync to not need to rely upon the trust_hash= provider for the integrity of their snapshot data? (that'd be a pleasant surprise)

I did a little reading in case we need to go for a non-IAVL tree approach and instead need to generate a blob of SwingSet data for which we include only the hashes in the IAVL tree.

The premise is that we'd have a SwingSet snapshot config in consensus (unlike regular cosmos snapshots which are per-validator):

SwingSetSnapshotInterval: number of blocks between full snapshots
SwingSetSnapshotDelay: number of blocks between snapshot initiation and snapshot promotion

The idea is that cosmic-swingset would request a snapshot be initiated at multiples of SwingSetSnapshotInterval, but not finalize the content of the snapshot until SwingSetSnapshotInterval*i + SwingSetSnapshotDelay. That leaves the opportunity for SwingSet to create a dump of its state at the requested block without stopping the world. This relies on the fact that virtually all DBs support a read transaction mechanism that can run concurrently with write transactions without being affected by them.

To account for changes after the snapshot was initiated, the final dump will also need to contain a log for all the changes to the state dump since it was initiated, the same way Write Ahead Logs work for DBs, with the difference that all this data needs to be deterministic.

Once finalized the hashes of the state dump + log can be included in the IAVL tree for state sync consumers. We can then generate incremental write logs for every subsequent blocks, and adding their hash to the IAVL tree, so that any block can become a state sync source. During the time where a new snapshot has been requested but not yet finalized, these incremental logs for the "current" snapshot would keep being generated.

There is unfortunately a wrinkle introduced by a crash happening between when a swingset snapshot was requested and before it's finalized. When restarting, a new read/dump transaction would start at the last committed block instead of the previously requested block, yet the node is still responsible to produce a verifiable snapshot (or at least it must include its hash in the IAVL tree). It may be possible to handle this case through explicit SQLite snapshots, which are basically a way to reference an existing read transaction, and apparently can be recovered after an unexpected close. In particular it seems that the sqlite_snapshot struct is really an opaque copy of WalIndexHdr, which seem to contain only stateless data (since the idea is to be able to use this snapshot struct with another connection in another thread/process).

The actual DB dump can be generated a variety of ways (.dump, enumerate all entries manually, etc.), but it may be faster and more reliable to start with a VACUUM INTO operation to limit the amount of time the read transaction stays open.

Re-reading this issue and comments with the updated understanding I now have of state-sync.

And it has rpc_servers= and trust_height= fields, which tells me that it doesn't "ask around", it only asks the named servers, and only for snapshots at the specific height. So I think state-sync can only be safely initiated manually by the owner of the new node, and their integrity relies upon whoever they chose to get a trust_hash= from.

My understanding is that this is not fully accurate. The trust_height and trust_hash are the blockHeight and app hash for that block. I believe these are only used as a trust anchor and don't need to correspond to the height that will be state synced, and the height that will be synced may be a newer block. I think the RPC servers are used to verify any newer block's info.

Given that, the need for the snapshotted data to be verifiable, and snapshots being a per-node configuration, I do not believe a synchronous, stop the world when making a state-sync snapshot approach would work.

The approach we're now considering is somewhat close to @michaelfig's original post, which I will detail in a follow-up comment.

First, I am not yet familiar enough with ExtensionSnapshotters or the tendermint state sync protocol, so the following may be slightly inaccurate. I'll ask that @michaelfig or @JimLarson correct any mistake.

I also did not fully understand @michaelfig's detailed approach in the original post, in particular how the hashed data would only be included in a subsequent block, as from what I understand, we would then have a Swingset state lagging by a block when state-syncing.

The Swingset data falls roughly into 3 categories:

lots of small data that partially mutates every block (kvStore, aka kernel + vat stores)
data that is appended only, with occasional truncation (streamStore, aka vat transcripts)
a few large data chunks that are less frequently updated (snapStore, aka XS heap snapshots)

The following approach details how Swingset state can be split between a main artifact of a merkelized BD snapshot, and related blob artifacts that can be generated on demand, all while verifying the restored content through the root of the merkelized DB which would contain necessary hashes to verify the related blobs.

kvStore

For the first category, the kvStore, the primary option is to shadow the writes into a merkelized DB which would be part of the multistore under its own module/component, so there is no extra complexity in verifying the data and generating snapshots. We believe the performance impact of doing so will be acceptable, and #5934 is the issue to track verification of this assumption. The alternative is somewhat more complicated and partially detailed earlier in https://github.com/Agoric/agoric-sdk/issues/5542#issuecomment-1308023356.

The writes into this cosmos level DB can either be buffered until finish of END_BLOCK or streamed intermittently after every committed crank. In either case, we'll need an API to the swingStore to get the list of kvStore changes (sets/deletes) since the "last time" (#6562).

Restoring from this cosmos DB would simply mean enumerating all keys/values and recreating entries in the kvStore from them. This mapping may become somewhat more complicated if/when the kvStore changes to contain more structured data that is not longer simple key/value pairs, or when we get separate vat stores and kernel stores.

snapStore

We will include the hashes of the XS heap snapshots in this merkelized cosmos DB, and share each heap snapshot independently as verifiable external data (with possible chunking support in case these snapshots are larger than the 20MB state-sync chunk size limit).

XS heap snapshots are expected to be stored as blobs in SQLite (#6742), with each entry in the table containing the vatId, upgrade/incarnation number, start position in the transcript, compressed blob of the snapshot, and hash of the uncompressed snapshot. I envision pruning from this table to be a node specific configuration, with some keeping only the latest snapshot of each vat, others keeping all historical snapshots, and possibly any other variation in between.

At the end of every block, the heap snapshot hashes would be included in the cosmos side DB, but not the snapshot blobs themselves. We can decide on 3 different levels of hashes kept in the DB: keep all historical hashes, keep all hashes since the last upgrade of the vat, keep only the most recent hash of each vat. Regardless, we will need an API from swingStore to receive the new snapshots generated since "last time".

When a state-sync snapshot is requested for generation, the necessary snapshot blobs would be extracted from the swingStore. If the snapstore is implemented in SQLite, a read transaction could be used to generate the artifacts while Swingset continues making progress. This would work regardless of the pruning configuration of the snapstore (aka a host can be configured to keep only the last heap snapshot, and the state-sync artifacts can still be generated even if the snapstore entry gets deleted in the mean time).

Restoring would simply recreate the snapStore table from the snapshot artifact and data included in the cosmos DB (hash, vatId, incarnation number, startPos). Validation would simply mean decompressing the snapshot artifact and comparing its hash with the one recorded.

One issue is the state-sync artifact needs to itself hash consistently between nodes. For this we will need to ensure all nodes compress the snapshots in a deterministic way.

streamStore

Since the transcripts are append-only, we do not need to include their full content in the Merkelized DB, but instead can use a hash chain of new transcript entries per vat. Similarly to the snapStore data, the artifacts themselves can be generated on demand.

Transcripts are stored in the unified SQLite DB used by Swingset. Each entry records the vatID, incarnation/upgrade num, delivery num since upgrade, delivery made, syscalls received, and each syscall results. While only the deliveries since the last snapshot taken are necessary to get back up, I expect nodes to store in this DB all transcript entries for the latest incarnation of a vat, so that all nodes may be able to participate in a manchurian style upgrade (#1691). It's also possible some node may keep the full transcript of all previous incarnations for archival purposes.

Every time an entry is added to the transcript for a vat, a new hash is computed, composed of the previous hash and the normalized data in the transcript entry. At the end of the block, the latest hash is recorded in the cosmos DB. This is similar in how the activityhash works, but restricted to transcript entries for each vat. Similar to the snapStore, we have 3 levels of "starting points" for this chained hash: the first incarnation of the vat, the last upgrade of the vat, the last snapshot of the vat.

The most flexible approach is to start a new hash chain (segment) for every snapshot taken, but store in the cosmos DB the hash of previous transcript segments. Each transcript segment would be a separate state-sync artifact that can be independently verified. We can then easily change if all transcript entries since upgrade are synced, or only the last segment since the latest snapshot. We need an API from the swingStore to get all segment hashes modified since the "last time".

When the node requests a state-sync snapshot, we generate the actual segments from the streamStore. If the node truncates the streamStore on vat upgrades, we may need a read transaction on the streamStore to generate the state-sync artifact while allowing forward progress by Swingset.

Restoring would mean recreating the streamStore table from the different segments. Verifying the artifact is valid however requires the chain hash to be recomputed from the individual entries in the segment, then compared to the chain hash of record.

Agoric / agoric-sdk