chain-sends might get committed late: vulnerable crash window in cosmic-swingset

@mhofman and I were analyzing the way cosmic-swingset handles the swingstore export-data, and think we identified a small window of vulnerability, during which an application crash would leave the two databases (swingstore's SQLite, and the cosmos IAVL/LevelDB) in an inconsistent state.

I've been looking for the right place to perform the upgradeSwingset() call introduced by #9169. This needs to happen after the swingstore is created with openSwingStore(), but before makeSwingsetController() is called to use it. And the upgrade process will cause changes to the export-data, which might cause the exportCallback to be invoked (or might not, depending upon how swingstore buffers and flushes those changes, see below), which means the host must be ready for those callbacks.

Really, the swingstore design assumes that the host is ready for those callbacks from the moment swingstore is given the callback function: it should be legal for swingstore to invoke options.exportCallback even before openSwingstore() has returned, so that internal upgrades (eg #8089) can get their changes covered by the export-data hashes.

But, looking at launch-chain.js, it seems like we can't actually accomodate that.

It's working now because swingstore actually buffers the export-data until a kernelStorage.endCrank() (which doesn't happen during startup, only during a controller.run()), or a hostStorage.commit(), both of which happen late enough that the callbacks are prepared to work.

Waterken-style Output Embargo

In general, we base our execution model upon the (closely related and cross-inspirational) E and Waterken systems. We start with the E vat model:

an input queue of deliveries
turns/cranks which consume a delivery and:
- modify internal state
- produce output messages
along with c-lists between vats

To make this model robust against crashes, each "atomicity domain" has a reliable database, and we define a commit/checkpoint model which ensures that no matter where a given domain crashes, when it is restarted from the most recent DB checkpoint, it will proceed forwards in a way that is consistent with what happen before the crash.

We try to follow the Waterken system's "hangover inconsistency prevention" logic in our IO design. The basic idea is that output messages are embargoed until after the end-of-crank commit. During a crank, when a component (vat, or kernel overall) wants to send a message, we instead store it in a table of "pending outbound messages". At the commit point, this table is commited along with all the internal state changes (as well as the beginning-of-crank step that removed a delivery from the input queue). Then, only after commit, the surrounding code can release the outbound messages to some remote system (with a different DB and atomicity domain).

(if we were to release the outbound messages earlier, our host might crash before the commit point, and then the new post-reboot execution might do something different, which is the failure mode we call "hangover inconsistency")

To achieve at-least-once delivery reliability of messages, the IO system needs:

channel-scoped sequence numbers on all messages
channel-scoped acknowledgment numbers in the opposite direction
recipient does not ack a message until reliably committing it to an inbox or run-queue
sender does not retire a message until recipient has acked it
recipient ignores (but acks) duplicate seqnums

The Waterken server recorded all of its outbound messages as files in a directory, and attempted to re-send all of them each time a TCP (HTTP) connection was established to that peer. Old messages were retired (deleted) upon receipt of an ack, which was always a "high-water mark" (so ACK: 24 would retire all messages from 0 to 24).

The swingset comms protocol (and device-mailbox) is built around this principle, and provides reliable cross-kernel messaging (which we aren't currently using in cosmic-swingset, but which might get more attention as we look at running multiple communication chains).

Cosmos vs Swingset Atomicity Domains

Unfortunately, Cosmos and Swingset use separate databases. Very early on, we explored keeping all Swingset state in the Cosmos IAVL tree, but concluded that it would be too slow, and would prevent us from taking advantage of SQL features like CREATE INDEX that would provide a better data model (we'd have to encode everything into the IAVL key space, with a single lexicographic iteration primitive).

That means we have two atomicity domains, one per database. At end-of-block, we use hostStorage.commit() to perform a SQL COMMIT of the swingstore, and then a moment later, cosmos does a commit of the IAVL tree (i.e. a LevelDB commit()) to record all the cosmos-side state changes.

Assuming both systems are reliable on their own, this is safe against crashes that occur before the swingstore commit, or after the IAVL commit. When the application is rebooted, it will resume from a mutually-consistent state.

But, of course, we must also survive a crash which occurs between the two commit points. To achieve this, we apply a form of the Waterken protocol to the connection between Cosmos and Swingset. It's not quite the same, but it has similar attributes.

On each block, the cosmos/golang side sends directives into the kernel side, like AG_COSMOS_INIT / BEGIN_BLOCK / END_BLOCK / COMMIT_BLOCK (these are processed in launch-chain.js). Each one contains the block height that the cosmos side is currently executing. Upon receipt of END_BLOCK, the kernel side does a bunch of execution, and then commits its state, including a note about the block height that it just executed. As that execution runs, the kernel will perform device IO calls (mostly device-bridge messages like VBANK_GIVE and VBANK_GET_BALANCE), which are sent over to the cosmos side for processing.

We capture those device IO calls (and their results) in a list named savedChainSends. The list is stored durably.

Then, when the kernel is given a directive (BEGIN_BLOCK in particular), it compares the incoming block height against the last block it remembers executing and committing. In the normal case, the incoming height will be one higher than the one we remember, e.g. "please BEGIN_BLOCK of block 12" plus "I remember executing block 11".

But it might be equal (e.g. both 12), which means the kernel is being asked to execute a block that it already executed. This is the "ignore duplicate inbound messages" portion of the Waterken protocol. It means that we previously crashed after the swingstore commit (which captured the state changes of block 12), but before Cosmos managed to commit the block-12 IAVL changes. There are device IO calls that cosmos does not remember getting, and it needs them to achieve the correct block-12 state.

So in this case, we replay the savedChainSends back to cosmos instead of performing the swingset execution. This is very much like the way that vat workers are brought back up-to-date (from a heap snapshot) by replaying a transcript. In this case, we don't really care about the return values of the VBANK_GET_BALANCE calls: the kernel doesn't need them, although we might assert that cosmos returns the same values it did the first time around.

We need to save these chainSends durably so they'll survive the host crash, and we need to save them in the same atomicity domain (eg same DB) as the kernel state. But they're outside of the kernel's data model, so they must not interfere with kernelStorage. This is what swingStore.hostStorage was created for: a separate chunk of the kvStore table that can be used by the host app to track whatever it needs, but which gets committed at exactly the same time as everything in kernelStorage.

A function named clearChainSends empties the in-RAM list of chainSends and returns an Array, so the data can be stored durably into hostStorage. The saveOutsideState function does this, and stores blockHeight, and then calls hostStorage.commit(). This is called when processing the COMMIT_BLOCK directive:

  async function saveOutsideState(blockHeight) {
    const chainSends = await clearChainSends();
    kvStore.set(getHostKey('height'), `${blockHeight}`);
    kvStore.set(getHostKey('chainSends'), JSON.stringify(chainSends));

    await commit();
  }

Swingstore Export-Data

Swingstore's export-data is a list of key-value pairs which provide integrity validation for the swingstore export artifacts: when you want to export the state of a swingstore, you get the export-data list and you get a list of named artifacts, when you want to import it, you feed it the export-data list and then you feed it the artifacts. The importer relies upon the accuracy of the export-data, but it does not rely upon the accuracy of the artifacts.

This drastically improves the performance of validated exports: for each block, we update a little bit of export-data and store it into the cosmos IAVL state, but we do not need to emit all the artifacts. Artifacts are not generated until a cosmos state-sync export is requested, which might only be once per day (depending on how the node is configured). The cosmos/CometBFT state-sync import process will reliably populate IAVL completely, and verify the IAVL root against the chain's consensus state. We leverage that to provide validation of the swingstore artifacts.

So, as the kernel processes a block, it makes changes to the swingstore state, which causes changes to the export-data records. The swingstore API allows the host application to configure an exportCallback, which is called with these export-data deltas. The host is expected to apply these deltas to its own durable state (eg IAVL), so that a future state-sync export can include them, so that a future state-sync import can provide them to importSwingStore, so they can be used to validate the otherwise-untrusted swingstore artifact blobs.

Because the swingstore kvStore is not itself a Merkle tree, we shadow each kvStore entry into its own export-data entry (transcripts and snapshots and bundles are handled more efficiently). So the stream of export-data deltas will include a lot of overlap: a given kvStore key might be changed a dozen times over the course of a block. For performance, we batch the export-data deltas, in the hopes of removing these redundant changes, and only delivering one delta per key.

Internally, the swingstore records the export-data deltas in a SQL table named pendingExports. An internal function named flushPendingExports() removes the deltas from the table and feeds them to the exportCallback. This function is currently called in two places. The first is kernelStorage.endCrank(), which is called at the end of each crank. The second is hostStorage.commit(), which first does flushPendingExports(), and then does the SQL COMMIT statement that finally checkpoints the swingstore DB.

Note that hostStorage does not feed export-data (swingStore.js: hostKVStore.set lacks the noteExport calls which kernelKVStore.set has). The host application is responsible for populating hostStorage during state-sync import to the correct values. This probably also means we can't (safely) produce a state-sync export from a DB that is in the split-commit scenario, but that's ok because we only ever try to make an export after the IAVL commit finishes.

launch-chain.js tries to do async export-data callbacks

In the name of performance, launch-chain.js takes the synchronous exportCallback() invocations and attempts to process them in parallel with kernel execution. The export-data deltas need to be persisted in the IAVL tree, which means they need to be sent over the wall to the golang/cosmos side of the application, which means they travel the chainSend() pathway (in a message named swingStoreUpdateExportData). The cross-language context switch makes that relatively slow, so we didn't want the kernel to block waiting for it to complete.

So, the swingstore is given swingStoreExportSyncCallback, which pushes the changes into a queue, and does some work to initate a chainSend, but does not wait for it to complete. Some sort of async callback is arrange to react to its completion by pulling the next item from the queue and chainSending that. And then there's a flush operation that waits for everything to complete.

This gives the golang side more time to process most of the export-data deltas. To support this, we'd like as many of the deltas to be delivered early: if the only flushPendingExports() happened during hostStorage.commit(), then there wouldn't be must opportunity for parallelization. That's why we added a flush to endCrank(): it's a reasonable tradeoff between parallelizability and delta-merging.

@mhohman and I agreed that this code is really hard to follow and is in dire need of refactoring. We think the async aspect of it increases the potential for bugs, and doesn't match the the intention of the swingstore exportCallback API, and might kick us into that "two systems, reliable on their own, but now interact in surprising ways" category.

Problem 1: Export-Data is a chainSend, chainSend is stored in hostStorage

This finally brings us to the problems. The first is about how export-data deltas are committed.

Each controller.step() or controller.run() ends with a call to kernelKeeper.endCrank(). This is fortunate, because it means we do a flushPendingExports before returning control to the host application.

If any other swingstore changes were made, they would still be stuck in the pendingExports table when the COMMIT_BLOCK directive was processed and saveOutsideState() reached the call to commit(), at which point they'd be flushed, which would put them in chainSends, which would be hanging out in RAM, and which might or might not make it into IAVL before the cosmos-side IAVL commit (I'm not sure how the interlock works there). If the application crashed, they would be lost. If the application survives long enough to get to the next block, they'd be committed in that next block. Overall this would probably cause divergence between nodes that did the commit early and nodes that did it late, and recovering from a split-commit scenario would be dubious.

The swingstore design, and the documented API contract it has with the caller, should make it perfectly acceptable for commit() to write out any lingering export-data deltas. It invokes the exportCallback before the SQL commit, so it has satisfied its end of the contract. In fact the presence of a flushPendingExports inside endCrank should be merely an optimization, to benefit parallelization. So I feel it's a bug that we're actually dependent upon that endCrank flush for correctness.

To protect against future changes, we need a different design. One idea we discussed was to expose the currently-private flushPendingExports as hostStorage.flushPendingExports(), and require that the host call it manually, before calling hostStorage.commit(). With this approach, we'd make saveOutsideState look like:

  async function saveOutsideState(blockHeight) {
    hostStorage.flushExportCallbacks();
    const chainSends = await clearChainSends();
    kvStore.set(getHostKey('height'), `${blockHeight}`);
    kvStore.set(getHostKey('chainSends'), JSON.stringify(chainSends));
    await hostStorage.commit();
  }

We'd probably also make commit() assert that callbacks table is empty. This is a drag, it increases the host's API burden (exposing a previously-hidden implementation detail, and requiring them to engage in a more complex protocol). But it would ensure that the deltas were really flushed by the time the host uses them, and ensure that no additional deltas will appear before the hostStorage.commit() call is made.

That could work, but I'd prefer to find a way that doesn't expose the internal details like that.

(interlude: chainSends as Waterken output embargo)

If we look at this in terms of a Waterken IO design, we're using hostStorage chainSends as the embargoed-output table. We aren't actually embargoing the outputs (chainSend() sends them immediately to the golang side, in addition to recording them). We don't need to embargo because the cosmos side does not have a completely independent lifetime (unlike remote kernels): the IAVL tree is only committed after the SQL DB commits. So in the two-by-two table of (IAVL,swingstore) times (committed,not), we can never be in the corner where IAVL has committed but SQL has not. So we must record the chainSends, but we don't need to hide them.

This is good, because otherwise we couldn't do VBANK_GET_BALANCE, which requires a return value. A deep design decision in our platform is whether components can do blocking reads or not. For now, the kernel can do a blocking read of the cosmos side (VBANK_GET_BALANCE), and vats can do a blocking read of a device node (result = syscall.callNow(device_vref, methargs)). But we need to remove those from vats before we can make parallel execution work (https://github.com/Agoric/agoric-sdk/issues/6447#issuecomment-1291423757).

So chainSends are more like a vat's transcript, where we perform the syscalls immediately, but record them just in case we need to play them back in a future restart (bringing a vat worker online).

And the trick is that we need this kernel-level transcript-like data to be committed in the same atomicity domain as the kernel's own state, which requires hostStorage.

chainSends currently have a somewhat tortuous path from exportCallback to the chainSends array, to clearChainSends(), to the hostStorage.kvStore.set(). And the async/parallel nature of that path does not give me confidence that every delta has made it to the end by the time we commit.

It might be more direct if exportCallback immediately (synchronously) stored its delta in hostStorage, so that commit() could be called at any moment, and could do a flushExportCallbacks internally, and we'd still always be up-to-date. However the volume of the deltas means that we probably can't afford to update the one hostStorage chainSends key on every callback.

So, I'm still looking for an approach that is more satisfying than exposing flushExportCallbacks and obligating the caller to use it.

Note that our mailboxStorage approach has similar issues, but pre-dates chainSends and export-data, and we haven't looked at it for a long time.

Problem 2: When is upgradeSwingset safe?

This sets up the more immediate problem that I set out to address: when is it safe for us to call upgradeSwingset?

The new upgradeSwingset call must happen before we build the kernel, because the kernel asserts that the internal kvStore schema version is up-to-date. And if upgradeSwingset actually upgrades anything, it will change kvStore values, which will produce export-data deltas.

This needs to happen before we build the kernel, so before buildVatController(), which makes it pretty early, well before we're processing any blocks. The swingstore API assumes that exportCallback can be called instantly, even during openSwingstore (for #8089 upgrade purposes). But the whole chainSends getting looped into hostStorage thing means that our host app is not actually ready that early (modulo buffering). At the very least, we must get hostStorage back from openSwingstore before a "cleaner" (blocking/synchronous) exportCallback could possibly write anything into it.

We have two levels of buffering which help out (any maybe a third). The first is that swingstore internally buffers the deltas in the pendingExports table (so in the pending/uncommitted SQL transaction, on disk, but not durable). They remain there until the first endCrank or commit.

The second is that cosmic-swingset holds chainSends in RAM. The maybe-third is the async queue thing, which (maybe?) allows some exportCallback calls to not actually run until later, with a flush mechanism to ensure they actually do run before we call commit().

So I think we can call upgradeSwingset early, just after openSwingstore. The export-data changes will hang out in the pendingExports table until the first controller.run() hits an endCrank, at which point the flushExportCallbacks will start them on their journey to chainSends. They should finish that journey by the time we call commit, and they'll get committed in the first block after the chain upgrade.

But... what if we don't happen to hit an endCrank? I don't think this can happen.. if controller.run() finds no work to be done, the finally block in kernel.js still invokes endCrank. But if somehow we avoided an endCrank, we'd wind up in the Problem 1 case above, which would spell trouble.

Refining the analysis

The comms protocol must accomodate each peer falling arbitrarily far behind. When we look at a single direction (eg Alice->Bob), the recipient cannot be ahead of the sender (the message hasn't been created yet), but it might have acked the most recent message, or only the previous one, or the one before that, etc, or perhaps it hasn't acked any messages at all. The table of possible states looks like this, where the x axis is which message Alice has emitted, and the Y axis is which message Bob has acked:

	a0	a1	a2	a3
b0	*	*	*	*
b1		*	*	*
b2			*	*
b3				*

In contrast, the cosmos/kernel protocol cannot fall more than one step behind: the kernel will not be given work for block 2 until cosmos has committed block 1. There are only two databases, and cosmos strictly commits a given block to IAVL after swingstore has committed the same to SQL. So either both have committed the same thing, or SQL has committed and IAVL has not.

	k0	k1	k2	k3
c0	*	*
c1		*	*
c2			*	*
c3				*

This relaxes the "embargoed durable sends" rule of the Waterken protocol. Our chainSends must be durable because we might have fallen behind: if the kernel commits block 2, and cosmos has not committed block 2, then when we restart, cosmos needs to re-execute the block, and for that it needs all the chainSends from block 2, even though the kernel is not going to re-execute as well. The block-2 sends must be recorded in swingstore, else they might not be available for a restart. And they must remain there until we can't possibly need them anymore, which happens when IAVL commits block-2. It is then easiest to retain them until the kernel starts block-3, or until just before it commits block-3. While we could add an extra step that removes the now-unneeded chainSends from swingstore and commit the txn, that would mean three total commits (swingstore, cosmos, swingstore), introducing yet another crash location to consider (IAVL commits, but then we crash before committing the swingstore chainSend deletion, vs crashing after the chainSend deletion but before the end-of-block swingstore commit), and the only benefit would be removing a temporary spike in SQL space usage.

But we don't need the sends to be embargoed, because the recipient won't remember them (i.e. commit IAVL) unless the kernel reaches its own commit point. We're allowed to reveal the bridge-device actions as soon as they happen, just as long as we remember them (for the benefit of cosmos, not the kernel) at least until we start the next block.

The "recipient won't commit early" property is what allows our bridge device messages to have answers. The comms protocol doesn't have this luxury, so all comms messages are async (send, no response).

Types of Chain Sends

We basically have three kinds of messages:

send-only (e.g. VBANK_GIVE or a chain-storage write)
query with result (e.g. VBANK_GET_BALANCE)
export-data writes

The first two are coming from the bridge or mailbox device, and either write to or read from some portion of IAVL. These are the real device IO pathway for the kernel.

The third, export-data writes, is not driven by the kernel; it's driven by swingstore. While the export-data needs to be written into IAVL, and the changes need to be durable (against the "SQL committed but IAVL didn't" case), we could afford to use a different pathway than the device IO messages.

For example, one option would be to accumulate all the export-data callbacks into RAM, and then store them directly into their own portion of hostStorage around the hostStorage.commit time, and then send them to the cosmos side (sending them to the chain, but not tracking them in chainSends). Given the overlap between the keys modified by each crank, this would result in a smaller total volume of changes (e.g. kvStore.set('crankNumber',..) happens once per crank, but really only needs to cause one IAVL change per block). However they would all delivered at the same time, which might make for a larger message than we'd like, and they would all be processed during the "critical window" between SQL commit and IAVL commit, which would increase the probability of a crash requiring a replay-chainSends recovery.

Integrating export-data deltas

From swing-store's point of view, export-data is a straightforward part of exports, and export-data callbacks are clearly necessary to allow host applications to incrementally commit to the export contents. When we combine this with the Waterken durable-output IO model, it becomes clear that any host app which uses a second DB and export-data callbacks will need to write the export-data deltas to hostStorage.kvStore.

So, what if we baked this into swing-store? Instead of having the host app build its own mechanism around hostStorage.kvStore (like chainSends), we could have swing-store record the export-data deltas directly into a new section of hostStorage, and retrieve them later.

hostStorage.commit();
const deltas = hostStorage.getExportDeltas();

Assuming we can tolerate getting the export-data deltas late in the block cycle, then instead of using an immediate exportCallback, the swing-store could leave everything in pendingExports. Then, during hostStorage.commit(), it clears a new table named exportDeltas, copies everything from pendingExports into exportDeltas, and then finally performs the SQL COMMIT (or, use just one table and include a generation-number column to distinguish between new and old, or something). A new hostStorage.getExportDeltas() function would return (or iterate through) the deltas from that table.

The host app would finish hostStorage.commit(), and only then grab the export deltas and feed them into the second DB (IAVL). As described above, this would be less work overall, but might do more of it at an inconvenient or inefficient time.

The host app would still need something like chainSends for the device-IO messages (VBANK_GIVE/etc), but since those only happen during a controller.run(), it can be confident that they've all been performed by the time it considers doing a commit. But this would safely allow e.g. upgradeSwingset to create export-data deltas and have them delivered into IAVL, even without a crank being performed.

If we decided that the immediate callback provided performance benefits, we could do both. openSwingstore(path, { exportCallback }) would still arrange for the callback to be invoked quickly, at the end of each crank, but the callback would no longer be obligated to remember the details (in chainSends). Instead, the durability requirement would be satisfied by swingstore automatically updating the exportDeltas table, and the recovery path would use to getExportDeltas at restart time to acquire the list of changes that need to be "replayed", with something like:

function recover() {
  for (let msg of JSON.parse(hostStorage.kvStore.get('chainSends'))) {
    sendToChain(msg);
  }
  for (let [key,valueOrNull] of hostStorage.getExportDeltas()) {
    sendToChain({ type: 'swingsetExportData', key, value: valueOrNull });
  }
}

Agoric / agoric-sdk