Open warner opened 1 week ago
The comms protocol must accomodate each peer falling arbitrarily far behind. When we look at a single direction (eg Alice->Bob), the recipient cannot be ahead of the sender (the message hasn't been created yet), but it might have acked the most recent message, or only the previous one, or the one before that, etc, or perhaps it hasn't acked any messages at all. The table of possible states looks like this, where the x axis is which message Alice has emitted, and the Y axis is which message Bob has acked:
a0 | a1 | a2 | a3 | |
---|---|---|---|---|
b0 | * | * | * | * |
b1 | * | * | * | |
b2 | * | * | ||
b3 | * |
In contrast, the cosmos/kernel protocol cannot fall more than one step behind: the kernel will not be given work for block 2 until cosmos has committed block 1. There are only two databases, and cosmos strictly commits a given block to IAVL after swingstore has committed the same to SQL. So either both have committed the same thing, or SQL has committed and IAVL has not.
k0 | k1 | k2 | k3 | |
---|---|---|---|---|
c0 | * | * | ||
c1 | * | * | ||
c2 | * | * | ||
c3 | * |
This relaxes the "embargoed durable sends" rule of the Waterken protocol. Our chainSends
must be durable because we might have fallen behind: if the kernel commits block 2, and cosmos has not committed block 2, then when we restart, cosmos needs to re-execute the block, and for that it needs all the chainSends
from block 2, even though the kernel is not going to re-execute as well. The block-2 sends must be recorded in swingstore, else they might not be available for a restart. And they must remain there until we can't possibly need them anymore, which happens when IAVL commits block-2. It is then easiest to retain them until the kernel starts block-3, or until just before it commits block-3. While we could add an extra step that removes the now-unneeded chainSends
from swingstore and commit the txn, that would mean three total commits (swingstore, cosmos, swingstore), introducing yet another crash location to consider (IAVL commits, but then we crash before committing the swingstore chainSend deletion, vs crashing after the chainSend deletion but before the end-of-block swingstore commit), and the only benefit would be removing a temporary spike in SQL space usage.
But we don't need the sends to be embargoed, because the recipient won't remember them (i.e. commit IAVL) unless the kernel reaches its own commit point. We're allowed to reveal the bridge-device actions as soon as they happen, just as long as we remember them (for the benefit of cosmos, not the kernel) at least until we start the next block.
The "recipient won't commit early" property is what allows our bridge device messages to have answers. The comms protocol doesn't have this luxury, so all comms messages are async (send, no response).
We basically have three kinds of messages:
VBANK_GIVE
or a chain-storage write)VBANK_GET_BALANCE
)The first two are coming from the bridge or mailbox device, and either write to or read from some portion of IAVL. These are the real device IO pathway for the kernel.
The third, export-data writes, is not driven by the kernel; it's driven by swingstore. While the export-data needs to be written into IAVL, and the changes need to be durable (against the "SQL committed but IAVL didn't" case), we could afford to use a different pathway than the device IO messages.
For example, one option would be to accumulate all the export-data callbacks into RAM, and then store them directly into their own portion of hostStorage
around the hostStorage.commit
time, and then send them to the cosmos side (sending them to the chain, but not tracking them in chainSends
). Given the overlap between the keys modified by each crank, this would result in a smaller total volume of changes (e.g. kvStore.set('crankNumber',..)
happens once per crank, but really only needs to cause one IAVL change per block). However they would all delivered at the same time, which might make for a larger message than we'd like, and they would all be processed during the "critical window" between SQL commit and IAVL commit, which would increase the probability of a crash requiring a replay-chainSends recovery.
From swing-store's point of view, export-data is a straightforward part of exports, and export-data callbacks are clearly necessary to allow host applications to incrementally commit to the export contents. When we combine this with the Waterken durable-output IO model, it becomes clear that any host app which uses a second DB and export-data callbacks will need to write the export-data deltas to hostStorage.kvStore
.
So, what if we baked this into swing-store? Instead of having the host app build its own mechanism around hostStorage.kvStore
(like chainSends
), we could have swing-store record the export-data deltas directly into a new section of hostStorage
, and retrieve them later.
hostStorage.commit();
const deltas = hostStorage.getExportDeltas();
Assuming we can tolerate getting the export-data deltas late in the block cycle, then instead of using an immediate exportCallback
, the swing-store could leave everything in pendingExports
. Then, during hostStorage.commit()
, it clears a new table named exportDeltas
, copies everything from pendingExports
into exportDeltas
, and then finally performs the SQL COMMIT
(or, use just one table and include a generation-number column to distinguish between new and old, or something). A new hostStorage.getExportDeltas()
function would return (or iterate through) the deltas from that table.
The host app would finish hostStorage.commit()
, and only then grab the export deltas and feed them into the second DB (IAVL). As described above, this would be less work overall, but might do more of it at an inconvenient or inefficient time.
The host app would still need something like chainSends
for the device-IO messages (VBANK_GIVE
/etc), but since those only happen during a controller.run()
, it can be confident that they've all been performed by the time it considers doing a commit. But this would safely allow e.g. upgradeSwingset
to create export-data deltas and have them delivered into IAVL, even without a crank being performed.
If we decided that the immediate callback provided performance benefits, we could do both. openSwingstore(path, { exportCallback })
would still arrange for the callback to be invoked quickly, at the end of each crank, but the callback would no longer be obligated to remember the details (in chainSends
). Instead, the durability requirement would be satisfied by swingstore automatically updating the exportDeltas
table, and the recovery path would use to getExportDeltas
at restart time to acquire the list of changes that need to be "replayed", with something like:
function recover() {
for (let msg of JSON.parse(hostStorage.kvStore.get('chainSends'))) {
sendToChain(msg);
}
for (let [key,valueOrNull] of hostStorage.getExportDeltas()) {
sendToChain({ type: 'swingsetExportData', key, value: valueOrNull });
}
}
I reviewed the kernel code, and controller.run()
always ends with a kernelKeeper.endCrank()
, even if there's no work to do (thetry
block encloses the if (!message) break
clause, and its finally
does the endCrank
, so the only pathways that could skip the endCrank
are already-panicking and not-start()
-ed, both of which would be host-app misbehaviors).
So the swingstore's semi-accidental "we'll buffer export-data callbacks until endCrank
or commit
" behavior means the host app doesn't really need to handle callbacks during openSwingstore
, and the kernel's "controller.run()
always calls endCrank
" behavior means the host app doesn't really need to handle callbacks during commit
.
So I think we don't currently have a vulnerable crash window.
To guard against changes in either swingstore or the kernel, we could enhance cosmic-swingset to assert that the exportCallback
was not called in the windows for which it wasn't really ready.
exportCallback
to a function that includes an assert(exportCallbackReady, 'export-data callback invoked while unprepared')
checkexportCallbackReady = false
openSwingstore
, after we're sure we're no longer in a hangover situtation (and the chain-sends buffer has been cleared), set it to true
upgradeSwingset
or other before-an-END_BLOCK codesaveOutsideState()
, set it to false
just before the clearChainSends
(which is what causes us to stop being ready to handle callbacks)
endCrank
, or fails to call endCrank
, or a change in the host app that does a state-changing kernel call without also managing to trigger endCrank
true
.. somewhere? maybe the same = true
we added above will handle both ready pointsI'd like to allow swingstore to flush its callbacks any time, but the current situation will work safely, and the guards above would catch behavioral changes which would violate the assumptions we're currently relying upon.
Possibly related: https://github.com/Agoric/agoric-sdk/issues/8652
@mhofman and I were analyzing the way cosmic-swingset handles the swingstore
export-data
, and think we identified a small window of vulnerability, during which an application crash would leave the two databases (swingstore's SQLite, and the cosmos IAVL/LevelDB) in an inconsistent state.I've been looking for the right place to perform the
upgradeSwingset()
call introduced by #9169. This needs to happen after the swingstore is created withopenSwingStore()
, but beforemakeSwingsetController()
is called to use it. And the upgrade process will cause changes to the export-data, which might cause theexportCallback
to be invoked (or might not, depending upon how swingstore buffers and flushes those changes, see below), which means the host must be ready for those callbacks.Really, the swingstore design assumes that the host is ready for those callbacks from the moment swingstore is given the callback function: it should be legal for swingstore to invoke
options.exportCallback
even beforeopenSwingstore()
has returned, so that internal upgrades (eg #8089) can get their changes covered by the export-data hashes.But, looking at launch-chain.js, it seems like we can't actually accomodate that.
It's working now because swingstore actually buffers the export-data until a
kernelStorage.endCrank()
(which doesn't happen during startup, only during acontroller.run()
), or ahostStorage.commit()
, both of which happen late enough that the callbacks are prepared to work.Waterken-style Output Embargo
In general, we base our execution model upon the (closely related and cross-inspirational) E and Waterken systems. We start with the E vat model:
To make this model robust against crashes, each "atomicity domain" has a reliable database, and we define a commit/checkpoint model which ensures that no matter where a given domain crashes, when it is restarted from the most recent DB checkpoint, it will proceed forwards in a way that is consistent with what happen before the crash.
We try to follow the Waterken system's "hangover inconsistency prevention" logic in our IO design. The basic idea is that output messages are embargoed until after the end-of-crank commit. During a crank, when a component (vat, or kernel overall) wants to send a message, we instead store it in a table of "pending outbound messages". At the commit point, this table is commited along with all the internal state changes (as well as the beginning-of-crank step that removed a delivery from the input queue). Then, only after commit, the surrounding code can release the outbound messages to some remote system (with a different DB and atomicity domain).
(if we were to release the outbound messages earlier, our host might crash before the commit point, and then the new post-reboot execution might do something different, which is the failure mode we call "hangover inconsistency")
To achieve at-least-once delivery reliability of messages, the IO system needs:
The Waterken server recorded all of its outbound messages as files in a directory, and attempted to re-send all of them each time a TCP (HTTP) connection was established to that peer. Old messages were retired (deleted) upon receipt of an ack, which was always a "high-water mark" (so
ACK: 24
would retire all messages from 0 to 24).The swingset comms protocol (and device-mailbox) is built around this principle, and provides reliable cross-kernel messaging (which we aren't currently using in cosmic-swingset, but which might get more attention as we look at running multiple communication chains).
Cosmos vs Swingset Atomicity Domains
Unfortunately, Cosmos and Swingset use separate databases. Very early on, we explored keeping all Swingset state in the Cosmos IAVL tree, but concluded that it would be too slow, and would prevent us from taking advantage of SQL features like
CREATE INDEX
that would provide a better data model (we'd have to encode everything into the IAVL key space, with a single lexicographic iteration primitive).That means we have two atomicity domains, one per database. At end-of-block, we use
hostStorage.commit()
to perform a SQLCOMMIT
of the swingstore, and then a moment later, cosmos does a commit of the IAVL tree (i.e. a LevelDBcommit()
) to record all the cosmos-side state changes.Assuming both systems are reliable on their own, this is safe against crashes that occur before the swingstore commit, or after the IAVL commit. When the application is rebooted, it will resume from a mutually-consistent state.
But, of course, we must also survive a crash which occurs between the two commit points. To achieve this, we apply a form of the Waterken protocol to the connection between Cosmos and Swingset. It's not quite the same, but it has similar attributes.
On each block, the cosmos/golang side sends directives into the kernel side, like
AG_COSMOS_INIT
/BEGIN_BLOCK
/END_BLOCK
/COMMIT_BLOCK
(these are processed in launch-chain.js). Each one contains the block height that the cosmos side is currently executing. Upon receipt ofEND_BLOCK
, the kernel side does a bunch of execution, and then commits its state, including a note about the block height that it just executed. As that execution runs, the kernel will perform device IO calls (mostly device-bridge messages likeVBANK_GIVE
andVBANK_GET_BALANCE
), which are sent over to the cosmos side for processing.We capture those device IO calls (and their results) in a list named
savedChainSends
. The list is stored durably.Then, when the kernel is given a directive (
BEGIN_BLOCK
in particular), it compares the incoming block height against the last block it remembers executing and committing. In the normal case, the incoming height will be one higher than the one we remember, e.g. "pleaseBEGIN_BLOCK
of block 12" plus "I remember executing block 11".But it might be equal (e.g. both 12), which means the kernel is being asked to execute a block that it already executed. This is the "ignore duplicate inbound messages" portion of the Waterken protocol. It means that we previously crashed after the swingstore commit (which captured the state changes of block 12), but before Cosmos managed to commit the block-12 IAVL changes. There are device IO calls that cosmos does not remember getting, and it needs them to achieve the correct block-12 state.
So in this case, we replay the
savedChainSends
back to cosmos instead of performing the swingset execution. This is very much like the way that vat workers are brought back up-to-date (from a heap snapshot) by replaying a transcript. In this case, we don't really care about the return values of theVBANK_GET_BALANCE
calls: the kernel doesn't need them, although we might assert that cosmos returns the same values it did the first time around.We need to save these chainSends durably so they'll survive the host crash, and we need to save them in the same atomicity domain (eg same DB) as the kernel state. But they're outside of the kernel's data model, so they must not interfere with
kernelStorage
. This is whatswingStore.hostStorage
was created for: a separate chunk of thekvStore
table that can be used by the host app to track whatever it needs, but which gets committed at exactly the same time as everything inkernelStorage
.A function named
clearChainSends
empties the in-RAM list of chainSends and returns an Array, so the data can be stored durably intohostStorage
. ThesaveOutsideState
function does this, and storesblockHeight
, and then callshostStorage.commit()
. This is called when processing theCOMMIT_BLOCK
directive:Swingstore Export-Data
Swingstore's
export-data
is a list of key-value pairs which provide integrity validation for the swingstore export artifacts: when you want to export the state of a swingstore, you get theexport-data
list and you get a list of named artifacts, when you want to import it, you feed it theexport-data
list and then you feed it the artifacts. The importer relies upon the accuracy of theexport-data
, but it does not rely upon the accuracy of the artifacts.This drastically improves the performance of validated exports: for each block, we update a little bit of export-data and store it into the cosmos IAVL state, but we do not need to emit all the artifacts. Artifacts are not generated until a cosmos state-sync export is requested, which might only be once per day (depending on how the node is configured). The cosmos/CometBFT state-sync import process will reliably populate IAVL completely, and verify the IAVL root against the chain's consensus state. We leverage that to provide validation of the swingstore artifacts.
So, as the kernel processes a block, it makes changes to the swingstore state, which causes changes to the export-data records. The swingstore API allows the host application to configure an
exportCallback
, which is called with these export-data deltas. The host is expected to apply these deltas to its own durable state (eg IAVL), so that a future state-sync export can include them, so that a future state-sync import can provide them toimportSwingStore
, so they can be used to validate the otherwise-untrusted swingstore artifact blobs.Because the swingstore
kvStore
is not itself a Merkle tree, we shadow each kvStore entry into its own export-data entry (transcripts and snapshots and bundles are handled more efficiently). So the stream of export-data deltas will include a lot of overlap: a givenkvStore
key might be changed a dozen times over the course of a block. For performance, we batch the export-data deltas, in the hopes of removing these redundant changes, and only delivering one delta per key.Internally, the swingstore records the export-data deltas in a SQL table named
pendingExports
. An internal function namedflushPendingExports()
removes the deltas from the table and feeds them to theexportCallback
. This function is currently called in two places. The first iskernelStorage.endCrank()
, which is called at the end of each crank. The second ishostStorage.commit()
, which first doesflushPendingExports()
, and then does the SQLCOMMIT
statement that finally checkpoints the swingstore DB.Note that
hostStorage
does not feed export-data (swingStore.js:hostKVStore.set
lacks thenoteExport
calls whichkernelKVStore.set
has). The host application is responsible for populatinghostStorage
during state-sync import to the correct values. This probably also means we can't (safely) produce a state-sync export from a DB that is in the split-commit scenario, but that's ok because we only ever try to make an export after the IAVL commit finishes.launch-chain.js tries to do async export-data callbacks
In the name of performance, launch-chain.js takes the synchronous
exportCallback()
invocations and attempts to process them in parallel with kernel execution. The export-data deltas need to be persisted in the IAVL tree, which means they need to be sent over the wall to the golang/cosmos side of the application, which means they travel thechainSend()
pathway (in a message namedswingStoreUpdateExportData
). The cross-language context switch makes that relatively slow, so we didn't want the kernel to block waiting for it to complete.So, the swingstore is given
swingStoreExportSyncCallback
, which pushes the changes into a queue, and does some work to initate achainSend
, but does not wait for it to complete. Some sort of async callback is arrange to react to its completion by pulling the next item from the queue andchainSend
ing that. And then there's a flush operation that waits for everything to complete.This gives the golang side more time to process most of the export-data deltas. To support this, we'd like as many of the deltas to be delivered early: if the only
flushPendingExports()
happened duringhostStorage.commit()
, then there wouldn't be must opportunity for parallelization. That's why we added a flush toendCrank()
: it's a reasonable tradeoff between parallelizability and delta-merging.@mhohman and I agreed that this code is really hard to follow and is in dire need of refactoring. We think the async aspect of it increases the potential for bugs, and doesn't match the the intention of the swingstore
exportCallback
API, and might kick us into that "two systems, reliable on their own, but now interact in surprising ways" category.Problem 1: Export-Data is a chainSend, chainSend is stored in hostStorage
This finally brings us to the problems. The first is about how export-data deltas are committed.
Each
controller.step()
orcontroller.run()
ends with a call tokernelKeeper.endCrank()
. This is fortunate, because it means we do aflushPendingExports
before returning control to the host application.If any other swingstore changes were made, they would still be stuck in the
pendingExports
table when theCOMMIT_BLOCK
directive was processed andsaveOutsideState()
reached the call tocommit()
, at which point they'd be flushed, which would put them inchainSends
, which would be hanging out in RAM, and which might or might not make it into IAVL before the cosmos-side IAVL commit (I'm not sure how the interlock works there). If the application crashed, they would be lost. If the application survives long enough to get to the next block, they'd be committed in that next block. Overall this would probably cause divergence between nodes that did the commit early and nodes that did it late, and recovering from a split-commit scenario would be dubious.The swingstore design, and the documented API contract it has with the caller, should make it perfectly acceptable for
commit()
to write out any lingering export-data deltas. It invokes theexportCallback
before the SQL commit, so it has satisfied its end of the contract. In fact the presence of aflushPendingExports
insideendCrank
should be merely an optimization, to benefit parallelization. So I feel it's a bug that we're actually dependent upon thatendCrank
flush for correctness.To protect against future changes, we need a different design. One idea we discussed was to expose the currently-private
flushPendingExports
ashostStorage.flushPendingExports()
, and require that the host call it manually, before callinghostStorage.commit()
. With this approach, we'd makesaveOutsideState
look like:We'd probably also make
commit()
assert that callbacks table is empty. This is a drag, it increases the host's API burden (exposing a previously-hidden implementation detail, and requiring them to engage in a more complex protocol). But it would ensure that the deltas were really flushed by the time the host uses them, and ensure that no additional deltas will appear before thehostStorage.commit()
call is made.That could work, but I'd prefer to find a way that doesn't expose the internal details like that.
(interlude: chainSends as Waterken output embargo)
If we look at this in terms of a Waterken IO design, we're using hostStorage
chainSends
as the embargoed-output table. We aren't actually embargoing the outputs (chainSend()
sends them immediately to the golang side, in addition to recording them). We don't need to embargo because the cosmos side does not have a completely independent lifetime (unlike remote kernels): the IAVL tree is only committed after the SQL DB commits. So in the two-by-two table of (IAVL,swingstore) times (committed,not), we can never be in the corner where IAVL has committed but SQL has not. So we must record the chainSends, but we don't need to hide them.This is good, because otherwise we couldn't do
VBANK_GET_BALANCE
, which requires a return value. A deep design decision in our platform is whether components can do blocking reads or not. For now, the kernel can do a blocking read of the cosmos side (VBANK_GET_BALANCE
), and vats can do a blocking read of a device node (result = syscall.callNow(device_vref, methargs)
). But we need to remove those from vats before we can make parallel execution work (https://github.com/Agoric/agoric-sdk/issues/6447#issuecomment-1291423757).So
chainSends
are more like a vat's transcript, where we perform the syscalls immediately, but record them just in case we need to play them back in a future restart (bringing a vat worker online).And the trick is that we need this kernel-level transcript-like data to be committed in the same atomicity domain as the kernel's own state, which requires
hostStorage
.chainSends
currently have a somewhat tortuous path fromexportCallback
to thechainSends
array, toclearChainSends()
, to thehostStorage.kvStore.set()
. And the async/parallel nature of that path does not give me confidence that every delta has made it to the end by the time we commit.It might be more direct if
exportCallback
immediately (synchronously) stored its delta in hostStorage, so thatcommit()
could be called at any moment, and could do aflushExportCallbacks
internally, and we'd still always be up-to-date. However the volume of the deltas means that we probably can't afford to update the one hostStoragechainSends
key on every callback.So, I'm still looking for an approach that is more satisfying than exposing
flushExportCallbacks
and obligating the caller to use it.Note that our
mailboxStorage
approach has similar issues, but pre-dateschainSends
and export-data, and we haven't looked at it for a long time.Problem 2: When is upgradeSwingset safe?
This sets up the more immediate problem that I set out to address: when is it safe for us to call
upgradeSwingset
?The new
upgradeSwingset
call must happen before we build the kernel, because the kernel asserts that the internal kvStore schema version is up-to-date. And ifupgradeSwingset
actually upgrades anything, it will change kvStore values, which will produce export-data deltas.This needs to happen before we build the kernel, so before
buildVatController()
, which makes it pretty early, well before we're processing any blocks. The swingstore API assumes thatexportCallback
can be called instantly, even duringopenSwingstore
(for #8089 upgrade purposes). But the wholechainSends
getting looped intohostStorage
thing means that our host app is not actually ready that early (modulo buffering). At the very least, we must gethostStorage
back fromopenSwingstore
before a "cleaner" (blocking/synchronous)exportCallback
could possibly write anything into it.We have two levels of buffering which help out (any maybe a third). The first is that swingstore internally buffers the deltas in the
pendingExports
table (so in the pending/uncommitted SQL transaction, on disk, but not durable). They remain there until the firstendCrank
orcommit
.The second is that cosmic-swingset holds
chainSends
in RAM. The maybe-third is the async queue thing, which (maybe?) allows someexportCallback
calls to not actually run until later, with a flush mechanism to ensure they actually do run before we callcommit()
.So I think we can call
upgradeSwingset
early, just afteropenSwingstore
. The export-data changes will hang out in thependingExports
table until the firstcontroller.run()
hits anendCrank
, at which point theflushExportCallbacks
will start them on their journey tochainSends
. They should finish that journey by the time we callcommit
, and they'll get committed in the first block after the chain upgrade.But... what if we don't happen to hit an
endCrank
? I don't think this can happen.. ifcontroller.run()
finds no work to be done, thefinally
block in kernel.js still invokesendCrank
. But if somehow we avoided anendCrank
, we'd wind up in the Problem 1 case above, which would spell trouble.