Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
327 stars 206 forks source link

swing-store export/restore API (for state-sync) #6773

Closed warner closed 1 year ago

warner commented 1 year ago

What is the Problem Being Solved?

To support agd "state-sync" (#3769, #5542, #5934), we need the cosmos-side IAVL tree to contain enough information to restore a copy of the Swingset swing-store DB. For some data, we can store an exact copy in the IAVL tree: this uses extra disk space for the redundant copy, but we get state-sync distribution and validation for free (cosmos already knows how to publish and validate the IAVL contents). For other data, we can store a hash in the IAVL tree, and supply a hashed artifact later.

To make this work cleanly with the swingset/swing-store architecture, we're talking about an "export/restore" API for swing-store. Just like how SQLite has a .dump command that exports the entire DB in a simple text format (SQL commands which can be restored by .read), swing-store will have an API that lets you export the contents in a format that is convenient for the host application to store in a different DB, and/or distribute with artifacts later.

If speed/space performance were no concern, the design would be swingStore.export(directoryWriter), which would be given an authority to write arbitrary files to a specified directory tree. This directory full of files would contain the complete contents of the kvStore, the streamStore, and the snapStore. Then initSwingStore(dirPath) would grow a companion API named importSwingStore(dirPath, directoryReader), that would take the directory of files and produce a new (but fully populated) swingStore DB from the previously-exported contents.

(if we didn't care about determinism either, we could just run sqlite3 .dump swingstore.sqlite >export.sql and call it a day)

We might still include .export for testing or other use cases, but our chain imposes some constraints which would make that impractical for normal use. The cosmos state-sync design requires modules to commit to their contents for every block, even though state-sync export happens on a much slower schedule (perhaps once per day). The state-sync contents must be fully validatable from the block headers (i.e. the IAVL root hash). For chains that keep all of their state in IAVL, this happens automatically, but when modules maintain data outside of IAVL, they're generally required to record a hash of that data into IAVL and then be prepared to produce an artifact (blob) that can be validated by that hash. These artifacts are requested right away, immediately after the IAVL commit, but the module is allowed to take a long time to produce them, and the production work runs in parallel with ongoing block production, so the chain does not slow down. Each state-sync provider is allowed to produce these snapshots on their own schedule, outside of consensus.

In addition, our cosmic-swingset layer calls the swing-store commit() method outside of the context of a block (after it calls the endBlock() method), so it can no longer perform IAVL writes at that point. This rules out the most natural approach, where cosmic-swingset would do:

  await swingStore.commit();
  const artifacts = swingStore.export(); // or equivalent
  iavl.write(artifacts);

Data Validation

IAVL is a Merkle tree, and constantly updates its root hash. This means everything in IAVL can be easily validated against the root hash, which is included in each block header as the AppHash (more or less). So state-sync clients can fetch a copy of the IAVL data from an untrusted provider, populate a new IAVL instance with it, compute the root hash from that data, and then compare it against the consensus-verified AppHash. Clients do not proceed unless that hash matches, at which point they can rely upon the IAVL contents.

Cosmos state-sync provides a way for modules to publish additional artifacts in the state-sync snapshots (#5542). The requirement is that clients will be able to validate alleged copies of these artifacts upon receipt. Clients will fetch both the IAVL data and the other artifacts, then they'll verify IAVL root hash, then they must verify the other artifacts against data stored in the IAVL tree. Note that it doesn't hash to use an exact hash of the artifact for this purpose: the artifact might be compressed, or formatted differently, and the client may need to unpack or rearrange it before verification can happen (and before it can be used). The real requirement is that the unpacked form matches the data approved by consensus, so that an attacker cannot inject invalid data by supplying a malicious artifact.

For example, the swingset transcript store (streamStore) is constantly appending entries to the most recent span (the deliveries made since the last XS heap snapshot). We'll maintain a rolling hash of these entries:

const hash_0 = '';
const hash_1 = sha256(hash_0 + sha256(delivery_0));
const hash_2 = sha256(hash_1 + sha256(delivery_1));
const hash_3 = sha256(hash_2 + sha256(delivery_2));

This takes constant time to update, where as a hash of the entire (growing) span would cost O(N). The IAVL tree records the most recent hash_N value, updated on every block.The artifact is the list of [delivery_0, delivery_1, ..] entries. The client receives the IAVL tree (and validate it), then examines the delivery-list artifact. The client computes hash_0, hash_1, ..hash_N(which takes O(N) time), and compares it against the IAVL-provided value. At the end of the process, the client knows that alldelivery_0, .. delivery_Nentries are correct, even though nothing computedsha256(delivery_0 + delivery_1 + ..)`.

The swingset XS heap snapshot store (snapStore) records tuples of (vatID, startPos, heapSnapshotData, snapshotID), where the data is a compressed blob, and the ID is a hash of the uncompressed blob. We'll record the (vatID, startPos, snapshotID) tuples in the IAVL tree, and we'll use the compressed blob as the published state-sync artifact, published using the ID as a filename/artifact-name. The client must determine which snapshots are expected (one per vat, with the highest startPos), and for each of those, it should look for an artifact with the matching name. It must then decompress the artifact, hash the results, and validate that the results match the expected ID. Then it should recompress the decompressed data (to remove any sneaky attacks or variance arising from the compression format), and store the newly compressed data into the snapStore.

Description of the Design

Swingstore will be responsible for providing "exports" of its contents, when requested, at boundaries that correspond to blocks (one sampling point per commit() call). It will define an "export directory format", which is how the contents can be expressed as a directory full of files. This format is entirely up to swingstore (opaque to outsiders). It should be versioned: later versions of swingstore are not obligated to accept older exports, but it should error out cleanly, without risk of corrupted/confused data.

Swingstore will also define an "incremental export format", with similar opaqueness/versioning characteristics. This format will consist of the "export key-value pairs" and a set of named artifact blobs. The "export directory format" should be trivially convertable to/from this incremental format. The export keys are likely to be derived from the kvStore/snapStore/etc keys, e.g. kvStore.set('foo', 'bar') might result in an export key name of kvStore.foo. However the swingstore is free to use whatever key names it likes, and it is likely to produce a lot of keys that do not directly correspond to single entries in the various swingStore components, for validation hashes and metadata about which artifacts are required.

To write the export directory format, a new swingstore.export(exportPath) API will be added (or something that takes a suitably-limited write authority). This can be called at the appropriate time (outside of the "window") and the swingstore will immediately (any perhaps synchronously, TBD) write out the contents.

To import the directory format, a new importSwingStore(ssPath, exportPath) API will be added, as a module-level export, a sibling of openSwingStore/initSwingStore).

Incremental Export

When opening a swingstore (openSwingStore/initSwingStore), we'll add an option that enables the export feature, which allows the store to avoid work if the feature is not enabled. It also turns on an assertion to make sure the contents are not changed outside of the open/close window described below (to ensure that all changes are included in the export data). The value of this option will be a host-application -provided callback function (working name dataExportCallback), to receive the incremental export data.

Then we'll add a pair of swingstore APIs, working names are openExportWindow and closeExportWindow (but obviously TBD). The host application is required to sandwich their swingstore and kernel usage in the following pattern:

The open call simply sets the internal flag which says "modifications are allowed now". All normal swingstore APIs (kvStore.set, etc) will assert that this flag is true, with an error message pointing the user to call openExportWindow.

While changes are being made, swingstore might call dataExportCallback(pairs) with a list of [key, value] or [key, undefined] pairs (we use undefined to delete a key). Both key and value will be strings. These represent the "export key-value pairs" from the swingstore incremental export format. The host app is responsible for accumulating all these pairs (deleting when appropriate) and including them in the state-sync data.

The closeExportWindow() function clears the internal flag, and may be a opportunity for swingstore to finalize internal data structures, perfoming some last calls to dataExportCallback. The kernel is not allowed to make swingstore calls after closeExportWindow is invoked, and the swingstore must not invoke dataExportCallback after closeExportWindow returns.

If the host app determines that this particular block is the right time to produce a state-sync artifact, it will call ss.getExportArtifacts() after ss.commit(). It is obligated to call getExportArtifacts before the next call to openExportWindow, otherwise swingstore is free to discard data, making it impossible to recover those artifact blobs. Likewise getExportArtifacts must be called before the process terminates: swingstore is allowed to hold export pointers in ephemeral RAM that are not included in the durable database. In general, swingstore will seek to discard data as aggressively as possible, and getExportArtifacts is how the host-app signals that it needs some data to be retained long enough to be put into a state-sync artifact.

getExportArtifacts will return (TBD) an iterator of [name, dataIterator] pairs. The name corresponds to a filename in the export directory format, and the dataIterator should yield a binary blob (in reasonably-sized chunks) whose contents should be written to that file. The host application is not obligated to record these artifacts in that fashion, however the restore process will expect them to appear in a directory in this format.

The artifacts will include one entry for each vat which holds the most recent heap snapshot (validated by a export-key entry with the hash of the uncompressed contents, plus vatid/endpos metadata) plus an entry for the span of that vat's transcript entries since the snapshot point (validated by the cumulative hash of entries within the span). It might hold an entry for each earlier span.. we're still TBD about whether to include the historical ones or not, a tradeoff between state-sync size / new-client startup time, versus retaining an ability to do larger-scale replay-based upgrades without first consulting an archive node (possibly unavailable) to fetch the missing-but-hashed spans.

Import Time

When a new validator wants to start from a state-sync snapshot, the cosmos side will fetch the IAVL tree and all artifact blobs that were created at export time. It will then call into per-module hooks to offer them access to these blobs. The x/swingset module hook needs to populate an export-directory with all of this data, then call importSwingstore() pointing at the directory. Once complete, we should have a fully-populated swingstore, and we can launch a kernel against it (skipping initializeSwingset(), also skipping the bootstrap block).

Alternatively, we might build an incremental import API, to match the incremental export API. In this approach, importSwingstoreIncrementally() might be given an iterator of export-key-value entries to start with. It would drain the iterator, populating kvStore, but also accumulating a list of other data that it needs, including a list of transcript and heap-snapshot blobs. The API would also be given a callback which swingstore could use to ask for artifact blob contents. The cosmos-side x/swingset and/or the cosmic-swingset JS code would expect for swingstore to pull all the blobs that were referenced by export-key-value entries, and for swingstore to perform validation of those blobs before writing them into the streamStore and snapStore tables. (A lot of this depends upon how exactly the cosmos state-sync hooks work). Having an incremental API would remove the need for some code outside of swingstore to understand enough about the directory format to populate one, preserving the opacity of that format.

Security Considerations

The most important consideration is that the data written into the swingstore is fully validated against the root AppHash which was verified against the chain (approved by voting of the right set of validators). The state-sync artifacts contain alleged copies of this data, with various formatting changes, but the import process is responsible for verifying the contents before they are written into SQLite.

There are lots of opportunities to fail here: plenty of systems have accidentally trusted a complicated transfer format without realizing what sorts of attack vectors they've opened up, especially because a missing validation check doesn't cause visible functional problems. There's no good way to automatically test for this: it requires careful design and careful auditing.

The swingstore import API should return a Promise that rejects if any validation check failed. It should probably also delete the partially-initialized DB. In addition, we should not write anything into SQLite until after that piece has been validated.

Test Plan

Lots of unit tests, mostly in packages/swingstore, but also in cosmic-swingset.

mhofman commented 1 year ago

The swingset XS heap snapshot store (snapStore) records tuples of (vatID, startPos, heapSnapshotData, snapshotID)

We'll need to include an upgradeNum in there as well since I believe startPos gets reset on vat upgrade.

Swingstore will be responsible for providing "exports" of its contents, when requested, at boundaries that correspond to blocks (one sampling point per commit() call).

There may be multiple exports per commit() call, for example in the bootstrap block, or when executing swingset between block. I think in general we should not attach more semantics to commit beyond "make sure you've saved your data and cleanup discarded data as needed".

The host application is required to sandwich their swingstore and kernel usage in the following pattern:

  • ss.openExportWindow()
  • interact with devices, push things on run loop, make one or more calls to controller.run(), etc
  • ss.closeExportWindow()
  • ss.commit()
  • optional call to ss.getExportArtifacts()

It is obligated to call getExportArtifacts before the next call to openExportWindow, otherwise swingstore is free to discard data, making it impossible to recover those artifact blobs.

Again I think these restrictions are too restrictive.

the swingstore must not invoke dataExportCallback after closeExportWindow returns.

Being pedantic here, but after closeExportWindow()'s return promise settles is probably more accurate.

warner commented 1 year ago

BTW the directory form of this could probably be used to build the vaguely-editable "genesis block export" data structure that @arirubinstein has asked about, for situations where we need to halt chain1, export its state to a giant JSON file, modify that somehow, then launch chain2 from the edited version. You could modify some kvStore keys, or some scalar vatstore data, without too much fuss. Changing object references within virtual data could be done very very carefully (updating refcounts and c-list entries to match). Changing anything about the xsnap heap snapshots (and thus the ephemeral vat state) is probably impossible, but hey anything is possible if you're desperate enough.

warner commented 1 year ago

cc @FUDCo

mhofman commented 1 year ago

Likely duplicate of #6562

warner commented 1 year ago

Likely duplicate of #6562

Ok maybe, but please retain the swing-store -centric "how to I propagate my state to a new swing-store" perspective from this ticket (also the API sketch). The first swing-store needs to provide enough data to the first host application, to allow the second swing-store to request enough data from the second host application, to populate enough swing-store state, to allow the second kernel to "resume" from the snapshot. The other ticket's title "API to get summary of swingstore block changes" treats the host application as the principal, whereas I think it's helpful to think of swing-store as the instigator and host-app as dumb carrier of data.

mhofman commented 1 year ago

please retain the swing-store -centric "how to I propagate my state to a new swing-store"

Very much the plan. Actually the API is mostly based on state export, and the incremental export of KV data is orthogonal.

The other ticket's title "API to get summary of swingstore block changes" treats the host application as the principal

I agree that we can consider the other issue as a subset of this one.

also the API sketch

So we decided to eschew the openExportWindow / closeExportWindow as we didn't see a benefit. The swingStore will just always invoke the callback when it generates new data for each crank. If needed we may add a beforeCommit or something like that which would allow the swingStore to generate kv data that was not automatically generated during cranks but that must be included in the block. However we don't see a need for it at this point.

The current design I have is the following:

diff --git a/packages/swing-store/src/swingStore.js b/packages/swing-store/src/swingStore.js
index 94630f935..4ce0a2a61 100644
--- a/packages/swing-store/src/swingStore.js
+++ b/packages/swing-store/src/swingStore.js
@@ -62,6 +62,8 @@ export function makeSnapStoreIO() {
  *   commit: () => Promise<void>,  // commit changes made since the last commit
  *   close: () => Promise<void>,   // shutdown the store, abandoning any uncommitted changes
  *   diskUsage?: () => number, // optional stats method
+ *   setKVDataExportCallback: (callback: (newData: KVDataEntry[]) => void) => void, // Set a `callback` invoked by swingStore when new serializable data is available for export
+ *   getExporter(): SwingStoreExporter, // Creates an exporter of the swingStore content from the most recent commit point
  * }} SwingStoreHostStorage
  *
  * @typedef {{
@@ -82,6 +84,57 @@ export function makeSnapStoreIO() {
  * }} SwingStore
  */

+/**
+ * @typedef {[
+ *   key: string,
+ *   value: string,
+ * ]} KVDataEntry
+ *
+ * @typedef {object} SwingStoreExporter
+ * Allows to export data from a swingStore as a fixed view onto the content as
+ * of the most recent commit point when the exporter was created.
+ * The exporter may be used while another SwingStore instance is active for the
+ * same DB, possibly in another thread or process.
+ * It guarantees that regardless of the concurrent activity of other swingStore
+ * instances, the data representing the commit point will stay consistent and
+ * available.
+ *
+ * @property {() => AsyncIterator<KVDataEntry>} getKVData
+ * Get a full dump of KV data from the swingStore. This represent both the
+ * KVStore (excluding host and local prefixes), as well as any data needed to
+ * validate all artifacts, both current and historical. As such it represents
+ * the root of trust for the application.
+ * Likely content of validation data (with supporting entries for indexing):
+ * - lastStartPos.${vatID} = ${startPos}
+ * - transcript.${vatID}.${startPos} = ${endPos}-${rollingHash}
+ * - heap-snapshot.${vatID}.${startPos} = ${hash}
+ *
+ * @property {(options: {includeHistorical: boolean}) => AsyncIterator<string>} getArtifactNames
+ * Get a list of name of artifacts available from the swingStore
+ * A name returned by this method guarantees that a call to `getArtifact` on
+ * the same exporter instance will succeed. Options control the filtering of
+ * the artifact names yielded.
+ * Likely artifact names:
+ * - transcript-${vatID}-${startPos}-${endPos}
+ * - heap-snapshot-${vatID}-${startPos}
+ *
+ * @property {(name: string) => Promise<ArrayBuffer>} getArtifact
+ * Retrieve an artifact by name. May throw if the artifact is not available,
+ * which may occur if the artifact is historical and wasn't been preserved.
+ *
+ * @property {() => Promise<void>} close
+ * Dispose of all resources held by this exporter. Any further operation on
+ * this exporter or its outstanding iterators will fail.
+ */
+
+/**
+ * Function used to create a new swingStore from an object implementing the
+ * exporter API. The exporter API may be provided by a swingStore instance, or
+ * implemented by a host to restore data that was previously exported.
+ *
+ * @typedef {(exporter: SwingStoreExporter) => Promise<SwingStore>} ImportSwingStore
+ */
+
 /**
  * A swing store holds the state of a swingset instance.  This "store" is
  * actually several different stores of different types that travel as a flock

It uses a unified exporter interface for the host to get data from a swingStore to generate state-sync snapshots, as well as to create a new swingStore from either an existing swingStore, or from restored state-sync snapshot artifacts. The consumer drives the consumption of artifacts, deciding which artifacts are needed depending on the kind of usage (state-sync, shallow restore, full restore). The iterator approach that was previously considered would have forced the consumer to evaluate each artifact offered by the exporter, and decide whether it was needed or not for the use case.

I did not want to expose file system concerns in this API and preferred keeping it at the level of kv-data, artifact name and opaque data. It's fairly straight-forward to implement a consumer that uses the exporter to write files to disk, or implement the exporter API based on reading data from a directory.

The getArtifactNames method takes options which can be extended in the future to filter the artifacts needed. For example we could imagine requesting all transcript artifacts for a given vat to extract the data needed for a Manchurian style upgrade. Similarly we could imagine adding an API to an existing swingStore which takes an exporter allowing it to query and load historical transcripts (or even heap-snapshots).

warner commented 1 year ago

That sounds pretty good. I'm guessing that hostStorage.getExporter().getKVData() wouldn't be used in our situation, because we've been grabbing incremental entries from setKVDataExportCallback the whole time instead?

In discussion wit @FUDco, I've been describing this key/value dataset as the "shadow table", both to avoid confusion with SwingStoreKernelStorage.kvStore, and to emphasize that it serves as a root of trust. The swingstore tells the host application "you are responsible for safely delivering the contents of my shadow table to my successor", and "oh BTW the table is just a bunch of string key/value pairs". The swing-store is responsible for filling it with everything it needs to 1: repopulate the kvStore, 2: figure out what artifacts are needed (i.e. restore will fail unless all of these were available and valid), and 3: validate any artifacts it receives.

The successor swing-store gets the full contents of the shadow table first (and it is allowed to rely upon the contents being accurate and complete). Then it gets to work on artifacts, and must compare each alleged artifact against some hashes kept in the shadow table. Between the shadow table and the artifacts, the successor must be able to repopulate all the required data.

So a big chunk of the shadow table will be filled with kvStore entries, because we can't efficiently put them into an artifact (if our kvStore were merkleized like IAVL, it might be a different story). But those will live in a special part of the shadow table, maybe with all keys prefixed by kv. or something. And the rest of the shadow table will have hashes and metadata about what non-kvStore things to expect.

Chip and I figured that we'd need a shadow-table entry for every historical transcript span, forever, so that we retain the ability to validate spans pulled from an archive node in the case of a sleeper-agent upgrade situation. And we might retain an entry for every heap snapshot, just in case. Then there's an entry with the cumulative hash of the active transcript span for each vat (overwritten after every delivery), and likewise for the most recent heap snapshot for each vat.

As we add more tables to swing-store, we either shadow their contents into the shadow table, or we maintain a hash of their contents in the shadow table and prepare an artifact with the contents upon request.

Make sure there's a comment in swing-store with a schema for this shadow table, like the one for kvStore at the top of kernelKeeper.js, to keep track of what all the prefixes are. It's important that the kernel not be able cause kvStore writes to collide with non-kvStore pieces of the shadow table.

FUDCo commented 1 year ago

A question arising during implementation: instead of or in addition to the setKVDataExportCallback method (and we should definitely devise a better name for that), would it make sense to accept the callback in the options bag passed to makeSwingStore ?

mhofman commented 1 year ago

I think it's acceptable yes, and probably better. I put it on the host facet since it felt like it fit there along getExporter(), but since the host will never call setKVDataExportCallback and getExporter on the same swingStore instance, I'm ok removing them both from there:

warner commented 1 year ago

Here's some theory-of-operation documentation that I should have written when we started this effort. Finally writing it down is helping to clarify my thinking about the API in PR #7026, so I figured I'd do it before diving into the review. If it survives discussion, I'll make an additional PR to include the docs in the export API changes (creating packages/swing-store/docs/data-export.md and a supporting images/ subdir.

SwingStore Data Import/Export

The "SwingStore" package provides the database-backed storage component that each SwingSet kernel uses to hold all necessary state. This includes message queues, c-list tables, XS heap snapshots, and vat delivery transcripts. The host application is responsible for creating a swingstore instance and passing it to the new kernel, and for committing the store's database at the appropriate point in the execution cycle.

Some applications may want to record their state changes in a way that can be cloned, to create new instances of the application. For example, a blockchain may consist of many "validators", each of which holds a replica of (hopefully) identical SwingSet kernel state, and we need a way to launch new validators and bring them quickly and cheaply up-to-date with the existing ones. We want the old validators to publish their SwingSet state, and for a prospective new validator node to be able to download this state as a starting point, rather than needing to replay the entire transaction/transcript history of the chain from the beginning. This data may follow an untrusted path, so the new node must be able to rely upon (or validate) the data it receives. Typically there is a "block root hash" which they use as a starting point (which they either accept on faith from their operator, or which they somehow test against chain voting rules), then they can validate additional data against this root hash.

Blockchain platforms like cosmos-sdk have tools to implement "state-sync", so the library will handle data formatting and distribution. But at the application layer, we must provide the SwingStore state to this library in a suitable format. The cosmos-sdk state-sync tools require that 1: every block includes a commitment to the entire state of the application, and 2: every once in a while (perhaps once per day) the application will be asked for a set of "export artifacts". The combination of the current block's commitment and the export artifacts should be sufficient for a new participant to receive a state vector that can be safely validated against the current chain state.

Each SwingStore instance provides methods to facilitate this state export, and then to build a new SwingStore from the exported dataset. There is one set of methods to perform one-off full exports of the state. To facilitate consensus machines, a second set is provided to perform incremental export of just the validation data, allowing the (large) remaining data to be exported only on rare occasions.

Two Stages: Export Data and Export Artifacts

The SwingStore export protocol defines two stages (effectively two datasets). The contents of both are private to the SwingStore (the host application should make no assumptions about their contents or semantics). The first stage is called the "export data", and contains a set of key-value pairs (both strings, TODO blobs?). The second is a called the "export artifacts", each of which has a name (a string), and contains a blob of bytes. In general, the artifact blobs are much larger than the first-stage export data values, and take more time to generate. Host applications will typically not access the second-stage export artifacts until after the swingstore commit() is complete.

(image 1) swing-store export - Frame 1

Each time a SwingStore API is used to modify the state somehow (e.g. adding/changing/deleting a kvStore entry, or pushing a new item on to a transcript), the contents of both datasets may change. New first-stage entries can be created, existing ones may be modified or deleted. And the set of second-stage artifacts may change.

These export data/artifact changes can happen when calling into the kernel (e.g. invoking the external API of a device, causing the device code to change its own state or push messages onto the run-queue), or by normal kernel operations at it runs (any time controller.run() is executing). When the kernel is idle (after controller.run() has completed), the kernel will not make any changes to the SwingStore, and both datasets will be stable. Typically

Among other things, the swing-store records a transcript of deliveries for each vat. The collection of all deliveries to a particular vat since its last heap snapshot was written is called the "current span". The first-stage export data will record a single record for each vat that remembers the extent and the hash of the current span. This record then refers to a second-stage export artifact that contains the actual transcript contents.

(image 2a) swing-store export - Frame 2a

When a delivery is made, a new entry is appended to the end of the current span. This updates (replaces) the record in the first-stage export data: the new record has a longer extend (the endPos value is higher), and the span contents have a new hash. The second-stage export artifact is replaced as well: the name remains the same, but contents are now different.

(image 2b) swing-store export - Frame 2b

To clone a SwingStore, the host application must extract both stages from the source copy, and somehow deliver them to a new instance of the application, which can feed both datasets into a new SwingStore. When complete, the destination SwingStore will have the same contents as the original, or at least enough to continue execution from the moment of copy (it may be lacking optional/historical data, like non-current vat transcripts from before the most recent heap snapshot).

The host application is responsible for delivering both datasets, but it is only responsible for maintaining the integrity of the first stage export data. This table contains enough information to validate the contents of the export artifacts. The new clone is entirely reliant upon the contents of the first stage: if someone can manage to corrupt its contents, the new clone may be undetectably and arbitrarily corrupted. But as long as the first stage was delivered correctly, any changes to the second stage export artifacts will be discovered by the new SwingStore, and the import process will abort with an error. This split reduces the cost of supporting occasional state-sync export operations, as described below.

Full Export

The simplest (albeit more expensive) way to use SwingStore data export is by creating an "exporter" and asking it to a one-off full export operation.

The exporter is created by calling makeSwingStoreExporter(dirpath), passing it the same directory pathname that was used to make your SwingStore instance. This API allows the exporter to use a separate SQLite database connection, so the original can continue executing deliveries and moving the application forward, while the exporter continues in the background. The exporter creates a new read-only SQLite transaction, which allows it to read from the old DB state even though new changes are being made on top of that checkpoint. In addition, the exporter can run in a thread of child process, so the export process can run in parallel with ongoing application work. This gives you as much time as you want to perform the export, without halting operations.

After calling hostStorage.commit(), the host application can extract the first-stage export data, and then the second-stage export artifacts:

   const dirPath = '.../swing-store';
   const swingStore = openSwingStore(dirPath);
   ...
   await controller.run();
   hostStorage.commit();
   // spawn a child process

  // child process does:
  const exporter = makeSwingStoreExporter(dirPath);
  // exporter now has a txn, parent process is free to proceed forward
  const exportData = new Map();
  for (const [key, value] of exporter.getExportData()) {
    if (value) {
      exportData.set(key, value);
    } else {
      exportData.delete(key);
    }
  }
  const exportArtifacts = new Map();
  for (const name of exporter.getArtifactNames()) {
    exportArtifacts.set(name, exporter.getArtifact(name));
  }
  // export is 'exportData' and 'exportArtifacts'

(image 3) swing-store export - Frame 3

When doing a complete export, the getExportData() iterator will only announce each first-stage key once. (TODO) However, for completeness, the host application should be prepared to observe multiple assignments (and even deletions) of each key: the last update should win. Deletions are indicated by the value of a key-value pair being undefined.

Note that the new DB transaction is created during the execution of makeSwingStoreExporter(). If the exporter is run in a child process, the parent must ensure that it does not invoke the next hostStorage.commit() before the child reports that makeSwingStoreExporter() has completed. The export will capture the state of the SwingStore as of some particular commit, and we don't want to have a race between the parent finishing the next block, and the child establishing a transactional anchor on the state from the previous block.

Incremental Export

The full export can be useful for backing up a "solo" swingset kernel, where consensus among multiple nodes is not required. However the more common (and complicated) use case is in a consensus machine, where multiple replicas are trying to maintain the same state. SwingStore offers an "incremental export" mode that is designed to work with the cosmos-sdk state-sync protocol.

In this protocol, every block must contain enough information (hashes) to validate the entire state-sync dataset, even though most blocks are not used for for state-sync (and only a very few replicas will volunteer to create state-sync data). All validators vote on the block hashes, and these blocks are widely reported by block explorers and follower/archive nodes, so it is fairly easy to answer the question "is this the correct root hash?" for an arbitrary block height.

When someone wants to launch a new validator, they ask around for an available state-sync snapshot. This will typically come from an archiving node, which produces a new snapshot each day. The archive node will report back the block height of their latest state-sync snapshot. The new validator operator must acquire a valid block header for that height, doing their own due diligence on the correctness of that header (checking its hash against public sources, etc). Then they can instruct their application to proceed with the state-sync download, which fetches the contents of the state-sync snapshot and compares them against the approved block header root hash.

So, to include SwingStore data in this state-sync snapshot, we need a way to get the first-stage export data (including its validation hashes) into every block, as cheaply as possible. We defer the more expensive second-stage export until a state-sync producing node decides it is time to make a snapshot.

To support this, SwingStore has an "incremental export" mode. This is activated when the host application supplies an "export callback" option to the SwingStore instance constructor. Instead of retrieving the entire first-stage export data at the end of the block, the host application will be continuously notified about changes to this data as the kernel executes. The host application can then incorporate those entries into an existing hashed Merkle tree (e.g. the cosmos-sdk IAVL tree), whose root hash is included in the consensus block hash. Every time the callback is given (key, value), the host should add a new (or modify some existing) IAVL entry, using an IAVL key within some range dedicated to the SwingStore first-stage export data. When the callback receives (key, undefined), it should delete the entry. In this way, the IAVL tree maintains a "shadow copy" of the first-stage export data at all times, making the contents both covered by the consensus hash, and automatically included in the cosmos-sdk IAVL tree where it will become available to the new validator as it begins to reconstruct the SwingStore.

All validator nodes use this export callback, even if they never perform the rest of the export process, to ensure that the consensus state includes the entire first-stage dataset. (Note that the first stage data is generally smaller than the full dataset, making this relatively inexpensive).

Then, on the few occasions when the application needs to build a full state-sync snapshot, it can ask the SwingStore (after block commit) for the full set of artifacts that match the most recent commit.

(image 4) swing-store export - Frame 4

   const dirPath = '.../swing-store';
   const iavl = ...;
   function exportCallback(key, value) {
     const iavlKey = `ssed.${key}`; // 'ssed' is short for SwingStoreExportData
     if (value) {
       iavl.set(iavlKey, value);
     } else {
       iavl.delete(iavlKey); // value===undefined means delete
     }
   }
   const swingStore = openSwingStore(dirPath, { exportCallback });
   ...
   await controller.run();
   hostStorage.commit();

   // now, if the validator is configured to publish state-sync snapshots, 
   // and if this block height is one of the publishing points,
   // do the following:

  // spawn a child process

  // child process does:
  const exporter = makeSwingStoreExporter(dirPath);
  // note: no exporter.getExportData(), the first-stage data is already in IAVL
  const artifacts = new Map();
  for (const name of exporter.getArtifactNames()) {
    artifacts.set(name, exporter.getArtifact(name));
  }
  // instruct cosmos-sdk to include 'artifacts' in the state-sync snapshot

Import

On other end of the export process is an importer. This is a new host application, which wants to start from the contents of the export, rather than initializing a brand new (empty) kernel state.

When starting a brand new instance, host applications would normally call openSwingStore(dirPath) to create a new (empty) SwingStore, then call SwingSet's initializeSwingset(config, .., kernelStorage) to let the kernel initialize the DB with a config-dependent starting state:

// this is done only the first time an instance is created:

import { openSwingStore } from '@agoric/swing-store';
import { initializeSwingset } from '@agoric/swingset-vat';
const dirPath = './swing-store';
const { hostStorage, kernelStorage } = openSwingStore(dirPath);
await initializeSwingset(config, argv, kernelStorage);

Once the initial state is created, each time the application is launched, it will build a controller around the existing state:

import { openSwingStore } from '@agoric/swing-store';
import { makeSwingsetController } from '@agoric/swingset-vat';
const dirPath = './swing-store';
const { hostStorage, kernelStorage } = openSwingStore(dirPath);
const controller = await makeSwingsetController(kernelStorage);
// ... now do things like controller.run(), etc

When cloning an existing kernel, the initialization step is replaced with importSwingStore. The host application should feed the importer with the export data and artifacts, by passing an object that has the same API as the SwingStore's exporter:

import { importSwingStore } from '@agoric/swing-store';
const dirPath = './swing-store';
const exporter = {
  getExportData() { // return iterator of [key,value] pairs },
  getArtifactNames() { // return iterator of names },
  getArtifact(name) { // return blob of artifact data },
};
const { hostStorage } = importSwingStore(exporter, dirPath);
hostStorage.commit();
// now the swingstore is fully populated

Once the new SwingStore is fully populated with the previously-exported data, the host application can use makeSwingsetController() to build a kernel that will start from the exported state.

Optional / Historical Data

Some of the data maintained by SwingStore is not strictly necessary for kernel execution, at least under normal circumstances. For example, once a vat worker performs a heap snapshot, we no longer need the transcript entries from before the snapshot was taken, since vat replay will start from the snapshot point. We split each vat's transcript into "spans", delimited by heap snapshot events, and the "current span" is the most recent one (still growing), whereas the "historical spans" are all closed and immutable. Likewise, we only really need the most recent heap snapshot for each vat: older snapshots might be interesting for experiments that replay old transcripts with different versions of the XS engine, but no normal kernel will ever need them.

Most validators would prefer to prune this data, to reduce their storage needs. But we can imagine some extreme upgrade scenarios that would require access to these historical transcript spans. Our compromise is to record validation data for these historical spans in the export data, but omit the spans themselves from the export artifacts. Validators can delete the old spans at will, and if we ever need them in the future, we can add code that will fetch copies from an archive service, validate them against the export data hashes, and re-insert the relevant entries into the SwingStore.

The getArtifactNames() API includes an option named includeHistorical. If true, all available historical artifacts will be included in the export. If false, none will be included. Note that the "export data" is necessarily unaffected: if we ever want to validate this optional data, the hashes are mandatory. But the getArtifactNames() list will be smaller if you set includeHistorical = false. Also, re-exporting from a pruned copy will lack the old data, even if the re-export uses includeHistorical = true, because the second SwingStore cannot magically reconstruct the missing data.

In the future, we will arrange the SwingStore SQLite tables to provide easy sqlite3 CLI commands that will delete the old data, as well as options to openSwingStore() that will automatically delete old data as it becomes old. Validators who care about minimizing their disk usage will want to set this option, and/or periodically use the CLI command to prune the old data.

Implementation Details

SwingStore contains components to accomodate all the various kinds of state that the SwingSet kernel needs to store. This currently consists of three portions:

Currently, the SwingStore treats transcript spans and heap snapshots as export artifacts, with hashes recorded in the export data for validation (and to remember exactly which artifacts are necessary). The kvStore is copied one-to-one into the export data (i.e. we keep a full shadow copy in IAVL), because that is the fastest way to ensure the kvStore data is fully available and validated.

If some day we implement an IAVL-like Merkle tree inside SwingStore, and use it to automatically generate a root hash for the kvStore at the end of each block, we will replace this (large) shadow copy with a single kvStoreRootHash entry, and add a new export artifact to contain the full contents of the kvStore. This reduce the size of the IAVL tree, as well as the rate of IAVL updates during block execution, at the cost of increased CPU and complexity within SwingStore.

mhofman commented 1 year ago

Thanks for writing this up.

We should add a section about the preferred determinism for the artifacts data themselves. It's not mandatory for this scheme to work, but the underlying cosmos-sdk and tendermint protocol work a lot better if the data is the same across validators (state-sync snapshot chunks can be fetched from any tendermint node).

Notes:

the "export data", and contains a set of key-value pairs (both strings, TODO blobs?)

We decided that strings were sufficient for now. The vstorage API currently only supports strings (the underlying DB used by cosmos actually supports bytes if we need in the future).

Host applications will typically not access the second-stage export artifacts until after the swingstore commit() is complete.

The sentence reads out of place given we haven't gotten into the scheduling details. Also "typically" is too weak. The host application will never see or request artifacts before they're committed.

  const exportData = new Map();
  for (const [key, value] of exporter.getExportData()) {
    if (value) {
      exportData.set(key, value);
    } else {
      exportData.delete(key);
    }
  }
  const exportArtifacts = new Map();
  for (const name of exporter.getArtifactNames()) {
    exportArtifacts.set(name, exporter.getArtifact(name));
  }
  // export is 'exportData' and 'exportArtifacts'

Nit: .getExportData, .getArtifactNames and .getArtifact all return AsyncIterables to support streaming, and must be consumed by the time the exporter is closed, so this is not quite correct.

  const exportData = new Map();
  for (await const [key, value] of exporter.getExportData()) {
    if (value) {
      exportData.set(key, value);
    } else {
      exportData.delete(key);
    }
  }
  const exportArtifacts = new Map();
  for (await const name of exporter.getArtifactNames()) {
    const artifactData = await buffer(exporter.getArtifact(name));
    exportArtifacts.set(name, artifactData);
  }
  // export is 'exportData' and 'exportArtifacts'

(TODO) However, for completeness, the host application should be prepared to observe multiple assignments (and even deletions) of each key: the last update should win. Deletions are indicated by the value of a key-value pair being undefined.

Is it ok for such an exporter to carry these multiple assignments and deletions, and feed them into as-is to the importer?

This will typically come from an archiving node, which produces a new snapshot each day.

I'm not sure about the correctness of this assumption, but it's not material anyway. I don't know which type of nodes are configured to carry state-sync snapshots, but "archive nodes" seem restrictive.

The archive node will report back the block height of their latest state-sync snapshot. The new validator operator must acquire a valid block header for that height, doing their own due diligence on the correctness of that header (checking its hash against public sources, etc).

Again, not quite. The new node must only configure a trusted height and app hash, but that does not need to be the same height as existing state-sync snapshots, just a "root of trust". It needs to be a past height after which it's ok receiving a state-sync snapshot. Once a client discovers a state-sync snapshot through tendermint, it retrieves and validates the app hash for that snapshot height using the configured RPC server and the trusted height app hash. The effect is the same, the app hash for the snapshot height is explicitly trusted (through launch configuration), but the state-sync snapshot discovery is more flexible (uses tendermint p2p connections).

This is activated when the host application supplies an "export callback" option to the SwingStore instance constructor.

Currently it's not a constructor option but an explicit method call, but I'm ok either way

(image 4)

Nit: transcript-v4 in export data has entries with the same endPos but different hashes (forgot to increment the last entry)

  // child process does:
  const exporter = makeSwingStoreExporter(dirPath);
  // note: no exporter.getExportData(), the first-stage data is already in IAVL
  const artifacts = new Map();
  for (const name of exporter.getArtifactNames()) {
    artifacts.set(name, exporter.getArtifact(name));
  }
  // instruct cosmos-sdk to include 'artifacts' in the state-sync snapshot

Async iterators again here.

passing an object that has the same API as the SwingStore's exporter:

import { importSwingStore } from '@agoric/swing-store';
const dirPath = './swing-store';
const exporter = {
  getExportData() { // return iterator of [key,value] pairs },
  getArtifactNames() { // return iterator of names },
  getArtifact(name) { // return blob of artifact data },
};
const { hostStorage } = importSwingStore(exporter, dirPath);
hostStorage.commit();
// now the swingstore is fully populated

And here:

import { importSwingStore } from '@agoric/swing-store';
const dirPath = './swing-store';
const exporter = {
  getExportData() { // return async iterator of [key,value] pairs },
  getArtifactNames() { // return async iterator of names },
  getArtifact(name) { // return stream (async iterator of chunks) of artifact data },
};
const { hostStorage } = await importSwingStore(exporter, dirPath);
hostStorage.commit();
// now the swingstore is fully populated

The kvStore is copied one-to-one into the export data (i.e. we keep a full shadow copy in IAVL)

We need to be explicit about local. and host. sections. These should not be part of the export.