scheme for parallel/background vat execution

This ticket documents a design that should allow swingset to execute deliveries on multiple vats in parallel, continuously (even while the kernel is not inside a controller.run(), such as during the 5 seconds that Tendermint/Cosmos-SDK spends doing voting/consensus work), while still maintaining a consensus order of deliveries. We'd like this to improve performance over our current scheme, which only executes one vat delivery at a time, single-threaded, and only within controller.run.

The first section defines the "CDP" model, upon which the rest is based. The second section describes how to rewrite the vat worker to fit this model, and how DB commits are arranged to avoid "hangover inconsistency".

CDPs: Communicating Deterministic Processes

This section defines a "CDP" (Communicating Deterministic Process). This is a refinement of CSP (where the S means "Sequential") and the Waterken model, designed to tolerate processes being terminated at unpredictable times, but also to make more opportunities for parallelism.

A CDP is an abstract process which evolves through a sequence of "steps", each of which accepts an input and produces an output. At its core, the CDP is defined by a completely deterministic transition function (of input and previous state) which produces an output and the new state.

CDP-1

The CDP might exist as a standalone process, or it might share a process with another CDP. Each CDP has it's own commit points, and interact in such a way that one CDP can crash without undermining the assumptions and reliances that other CDPs might have on it.

All blockchains with finalization (e.g. Cosmos-SDK -based chains) behave like a CDP, in which each step is called a "block" or "blockHeight", and the inputs are the contents of the finalized block (mostly the set of transactions to be executed). In the Agoric system, the host application (cosmic-swingset / agd) acts as one CDP, the swingset kernel acts as a second (because of the separate kerneldb commit point), and there are additional CDPs for each of the N active vats.

State vs Runtime Incarnation

We split the CDP into "state" and "runtime". The state is long-lived, created when the CDP is initialized, and surviving until we really don't need it anymore. The state is recorded in a database, and each commit records one or more steps (never a partial step). The state can only be changed by an active runtime.

The runtime is a live OS process (either standalone or embedded in some other process, possibly shared with other CDPs). At any moment, each CDP has either exactly one active runtime (the CDP is currently "online"), or zero ("offline"). CDPs may be kept offline if they are not currently in use (because keeping one around costs some RAM even if nobody talks to it), or they may be preemptively brought online (because starting one costs some CPU time and latency, slowing down response time). All the API calls are made on a runtime, so offline CDPs cannot be manipulated.

Each time a runtime is created, we get a new "runtime incarnation". An CDP's current incarnation may remember slightly more than its persistent state, e.g. steps that it has performed but which have not yet been committed. When an incarnation is shut down or killed, the next incarnation may not remember those steps.

Inputs and Ouptuts

The submitInput(stepNum, input) method is used to submit a step input, and collectOutput(stepNum) is used to collect a step output. The type/shape of the inputs and outputs are specific to the CDP and its state transformation function.

submitInput() does not block, returns nothing, and is not obligated to execute the given step right away. It can be called multiple times (without intervening collectOutput calls) to fill the execution pipeline, allowing the CDP to perform work in the background or in parallel with other processes. Within any single incarnation (and barring a revert, described below), the sequence of submitInput() calls must use sequential stepNum values: no gaps, no repeats.

collectOutput(stepNum) is an async function whose return Promise does not resolve until the given step has been executed. If the CDP managed to process the requested step in the background, collectOutput may resolve immediately, otherwise the caller must wait until enough steps have been executed to produce the requested answer. It is an error to collect an output of a stepNum for which the CDP has never been given an input (although the input may have been given to an earlier incarnation). As with submitInput, within a single incarnation, the outputs must be collected from sequential stepNum values, with no gaps or repeats.

CDP-3

In general, the caller should provide inputs as soon as they are known (and well before the outputs are needed), to keep the pipeline full and maximize the opportunities for efficient parallel execution. If each submitInput() is followed by an immediate await collectOutput() for the same stepNum, there will be few opportunities for parallelism.

In the Agoric system, the kernel CDP only performs one step (block) at a time, but vats may be able to execute many steps (deliveries) in parallel.

Replayed Steps

When a CDP runtime is brought online, its DB will remember some number of inputs, and some number of outputs, based upon which steps had been completed before the last commit() performed by its predecessor.

The caller is using a different database, with different commit points. So the caller may have submitted an input and collected and output for e.g. step 5, but then crashed before it was able to perform its own commit. When the caller starts back up again, it will re-submit the input for step 5. This is called a "replayed input", and CDPs must tolerate them.

There are two cases. In the first, the CDP received submitInput(5, input5) but crashed before a commit(). In this case, the second incarnation can trivially tolerate the replay, because it does not remember the original. We rely upon the caller submitting exactly the same input, to avoid confusion.

In the other, the CDP did commit() after receiving input5, and perhaps others. We call the highest such committed input "highestSubmittedInputStep". We track another number named "highestRetiredStep" (where retire is described below).

When a CDP runtime is started, the new incarnation's first submitInput() call will accept any stepNum in a range from "highestRetiredStep" (exclusive) to "highestSubmittedInputStep" (inclusive). The second submitInput() call must use a stepNum exactly one higher than the first one, etc.

If the CDP detects that a submitInput() call is replaying an input, it refrains from actually executing the step. The "next expected stepNum" counter is incremented (or initialized), but no other state changes are made. The caller cannot tell, but CDP merely "pretends" to perform the step.

A CDP should ideally have a way to compare the replayed inputs it receives against their originals, to throw an error if they diverge. This might help detect errors made by the caller. A single hash of the input would be sufficient to detect divergence and throw an error, however to actually diagnose the problem, it would be better to remember the complete input (so the error message can display both original and the faulty replay). However given that a replay of an uncommitted original cannot be detected by the CDP, it's not worth putting too much energy or code into remembering the originals.

Likewise, the caller may ask for collectOutput(7) from one incarnation, then crash before persisting the results. When the caller restarts, it creates a new CDP runtime incarnation, and asks it (again) for collectOutput(7). The CDP must return the same outputs that its predecessor did. As above, the first collectOutput() call that the new incarnation receives is allowed to use a stepNum that meets highestRetiredStep < stepNum <= highestSubmittedInputStep, and all successive calls within that incarnation must use the next highest sequential stepNum. The CDP also tracks highestCollectedOutputStep.

This means that each time a step is executed, the CDP store the full output into its DB, so it can report a consistent+stable value to the caller (now, soon, or from some future incarnation).

Commit

To improve performance, the CDP is not obligated to commit its DB after every single step. It must never commit the partial states that occur in the middle of a step, but it is allowed to commit at any point between two complete steps.

But to enable the caller to rely upon the stability of the outputs, the caller can insist upon a commit point, by calling the CDP's commit() API. This async method will not resolve until the state and outputs of every step up to and including highestCollectedOutputStep has been committed. The CDP may have performed more work (whose output hasn't been collected yet): if so, these extra steps are committed too. But the caller is only allowed to rely upon the retention of the steps that it has collected.

In general, the CDP should use DB transactions or nested transactions to ensure that only complete steps might be committed, but refrain from doing an actual commit until requested by commit(). If the RAM requirements of the uncommitted base transaction becomes an issue, it can perform an unsolicited commit without causing problems, but minimizing commits (and their slow fsync() calls) will generally yield better performance.

Note: this is specifically intended to take advantage of SQLite's fast "WAL" mode, with PRAGMA synchronous = NORMAL. In this mode, the DB only records changes when a transaction is closed (preventing partial-step commits), but the DB is vulnerable to power failures and host OS crashes until a "checkpoint" occurs. These checkpoints happen spontaneously when the WAL grows large enough (so we must tolerate them), but can be forced with a PRAGMA wal_checkpoint(FULL). The CDP commit() API should force a checkpoint before returning.

Retire

To keep the CDP's "old-outputs storage" obligation bounded, the caller tells the CDP when it is safe to "retire" a step.

Once the caller collects the outputs of e.g. step 7, it will perform some other work, and then eventually commit its own state. Once the caller commits, it can use retire(7) to inform the CDP that it will never ask for the output of step 7 again (nor will it attempt to submit the inputs for step 7). The CDP can then delete the saved output for that step, as well as whatever diagnostic information it retained about the inputs (perhaps a hash, perhaps the full input data). The caller cannot safely retire a step until it has itself committed, otherwise a crash (after the collectOutput(7) but before the callers' commit) would violate the caller's promise. The CDP won't necessarily commit its memory of the retirement promptly (it is not obligated to do so until the commit() API is called), but since it is allowed to commit earlier, the caller must not reveal its intention to retire() until it is really ready for the old data to go away.

The caller might collect the outputs of multiple steps (7, 8, 9) before it gets to a convenient commit point. In that case, it only has to call retire(9), and that will implicitly retire all earlier steps.

A call to retire(9) will also set highestRetiredStep = 9. This value lives in the DB, but is not deserving of its own commit. The value will be committed as a side effect of the next step-execution or commit() -triggered DB commit.

In general, the a non-pipelining caller will execute in a loop like the following (suppose the recorded stepNum in the first pass is 6, so the first input submitted is for step 7):

caller incarnation starts
read last processed CDP step from caller storage into stepNum
- (highest stepNum for which collectOutput() was called and processed)
LOOP:
- cdp.retire(stepNum)
- stepNum += 1
- cdp.submitInput(stepNum)
- await collectOutput(stepNum)
- await commit()
- the caller commits its own state
- jump to LOOP

The caller/CDP might crash at various points in this loop:

after submitInput(7) but before collectOutput: the CDP committed state may or may not include the execution of stepNum 7, so the new caller incarnation will replay the old input, which will either be ignored (if the CDP did execute and commit it), or re-executed (if it did not manage to commit it)
after collectOutput but before commit: again the CDP committed state may or may not include that step 7, same cases as above
after commit but before the caller commits: the CDP state definitely includes step 7, so the new caller incarnation will request a replay, which the CDP will ignore. The cdp.commit() might commit extra steps (if it managed to make process on some pipelined inputs) or might be a NOP.
after the caller's commit point but before the next retire: the CDP has definitely not been told it is safe to retire step 7, so it will still be in the CDP DB. The new caller incarnation will start by retiring step 7, then submitting inputs for step 8
after the retire(7) but before submitInput(8): the change to highestRetiredStep may or may not have been committed, but the consequence is minimal: slightly higher DB usage until the next loop reaches a commit point

By performing the retire() at the beginning of the loop, the caller will start each incarnation by clearing out the unnecessary data from its predecessor.

Revert

Some CDPs are one-way, but others can support the revert(backToStepNum) operation. If supported, the caller can instruct the CDP to revert its internal state back to that of some previous step, at which point the caller is allowed to submit different inputs (and can observe different outputs) than before.

revert is generally expensive, and should only be used for exceptional cases. In the Agoric system, the primary example is an important vat whose latest delivery causes an internal fault, which we think we can repair through some sort of vat upgrade process. By reverting the vat to the state just before the fatal delivery, we can arrange for the next delivery to be something which changes/upgrades the internal state (to fix the bug). Another possibility is a metering fault (too much computation), where we don't want the vat to get away with using more computrons than it paid for, so we unwind the expensive delivery and require the vat to pay their compute bills before allowing that delivery to be repeated (note there are some questionable economics here, but the example is still useful). An Agoric vat worker would implement revert by discarding the xsnap worker and building a new one from the most recent heap snapshot (which must have been made before the target step), and omitting a few deliveries from the end of the transcript it replays (everything past the target step). It would also delete the omitted transcript entries, so they don't accidentally reappear in some future incarnation.

There is a limit to how many steps can be reverted: backToStepNum > highestRetiredStep, where highestRetiredStep is updated by each retire() call. A revert-capable CDP must remember (or be capable of regenerating) all old states back to highestRetiredStep + 1. For the Agoric vat worker, that means we must retain at least one heap snapshot with stepNum <= highestRetiredStep + 1 (which might cause us to remember multiple heap snapshots for a single vat).

API Summary

The library that provides a CDP with a given transition function will provide the following API:

initialize(stateDirectory, initialParameters): create an empty database in the given directory, then populate it with some initial state based upon the parameters
runtime = createRuntime(stateDirectory): start a "runtime" object, on which other API calls can be made. Multiple runtimes may not exist simultaneously.
- To detect collisions, each runtime could write a unique+random ID into the state directory once, at startup, and then read it back during each operation. If the value read back is ever different, we know two runtimes are fighting over the database, and we should make all of them abort.
runtime.submitInput(stepNum, input): submit input for eventual execution. Does not wait for execution. Does not return anything.
runtime.collectOutput(stepNum) -> Promise<output>: request the step to be executed promptly, and waits for the output to be returned
runtime.commit() -> Promise<void>: request the worker commit at least all previously-collected steps to the DB before returning, so the caller can rely upon the results being present in a later incarnation. The worker is allowed to commit on any step boundary, but is not obligated to do so until this commit() is invoked.
runtime.retire(upToStepNum): inform the worker that the given step (and all previous steps) can be forgotten, so the worker can delete the state required to remember them. The actual deletion will not touch the persistent state until the next commit happens. Does not wait for commit, does not return anything.
runtime.revert(backToStepNum) -> Promise<void>: tell the worker to revert to backToStepNum and commit immediately. The resulting worker will behave as if the last calls it saw were submitInput(stepNum), collectOutput(stepNum), and commit(), thus forgetting about previous calls for >= stepNum+1
- once the caller uses retire(upToStepNum), they may not use revert() beyond that point: backToStepNum > upToStepNum

Swingset Kernel as a CDP

Now that we have the CDP framework, how are the various Agoric/Swingset layers expressed?

The cosmic-swingset layer is already pretty well suited. As a block is processed, our cosmos-sdk Handlers push "actions" like bridgeInbound and deliverInbound (defined in https://github.com/Agoric/agoric-sdk/blob/master/packages/cosmic-swingset/src/action-types.js) into a queue on the Golang side. When we get to the END_BLOCK event (https://github.com/Agoric/agoric-sdk/blob/master/packages/cosmic-swingset/src/launch-chain.js), the JS code submits these actions into the kernel, and then runs the kernel until it reaches the runPolicy limit. As the kernel runs, it emits responses (chainSend) like storage writes. Between chain-main.js and launch-chain.js, the cosmic-swingset package has code that tolerates one block's worth of replay, to handle the case where the kernel commits its DB but the process crashes before cosmos-sdk can commit its own.

We can thus define cosmic-swingset's kernel as a non-reverting CDP whose "steps" are blocks, whose inputs are the "actions", and whose outputs are the chainSend messages. The host application performs steps without any sort of pipelining: each submitInput is immediately followed by a matching await collectOutput(), then an await commit(). The retire method is omitted, and the kernel is implicitly allowed to forget everything beyond one step in the past.

One constraint is that the outputs/chainSend calls are not allowed to have return values. The most likely example would be a chain-storage read, however I don't think we currently do any of those (there might be a method to return the externally-queryable chain-storage IAVL/RPC path: we'd need to implement that in different way).

Swingset Vat as a CDP

To take advantage of the parallelism offered by CDPs, we need to make some changes to the kernel/worker boundary:

each vat worker gets its own SQLite database
the vat worker manages its own vatstore, inside that DB
- so syscall.vatstoreGet/etc are no longer routed to the kernel, but are handled locally by the worker
the vat worker also manages heap snapshots and transcript replay
the remaining syscalls must be write-only

With vatstoreGet/vatstoreGetNext/etc out of the way, syscall.callNow (i.e. device-node invocation) is the only remaining syscall with a non-trivial return value. In practice, we have two kinds of device-node interaction:

chain-storage writes
bundlecap getBundle reads

The chain-storage writes do not have a return value, so we really only need an answer for getBundle. I think we can manage this by following through on "blobcaps" (#46), which would be a new form of kref/vref (e.g. bNN and vb-NN), that refers to some sequence of immutable bytes. The kernel needs to pay attention to the vrefs being translated in a VatDeliveryObject and notice when a new one has been added to the c-list. That delivery should include a copy of the data, so the worker can stash it in its DB for later access. We'll probably need refcounts on these, or (given how infrequently we use them) we might just decide to declare that vat workers remember the blob forever.

Once that's done, we can express each vat as a CDP where:

each "step" is a single delivery
the "input" is the VatDeliveryObject
the "outputs" are the sequence of syscalls made by the vat

To satisfy other compatibility/stability requirements, I think we'll wind up with all of the vat worker's components (endo/lockdown, supervisor, liveslots) living in a single swingset-vat-worker-xsnap-1 package. The kernel will no longer send the lockdown/supervisor bundles into the worker; instead they'll be bundled by the worker library itself and fed to xsnap as needed. The worker's DB will include any heap snapshots and transcripts that it needs to satisfy the CDP API, as well as additional calls to perform a "sleeper-agent" full-transcript-replay kind of upgrade (#1691).

The -1 suffix is a version identifier that combines everything which could influence the deterministic behavior. Once you've deployed a vat with the -1 worker, its behavior will remain stable despite any changes on the kernel side. No new versions of swingset-vat-worker-xsnap-1 will change that behavior: at most they will fix/improve things that do not threaten determinism. To make any significant changes (upgrade liveslots, new version of XS, etc), we must release a new package named swingset-vat-worker-xsnap-2, and (baggage-style) upgrade vats to use it. The first swingset kernel package will depend upon the -1 package, the second can depend upon both -1 and -2, the third can depend solely upon -2 (and all vat upgrades must be completed before you can upgrade a deployment from the second kernel package to the third).

Kernel Scheduler

Our improved scheduling plans call for each vat to have an input queue (of deliveries), and an output queue (of delivery results, including message-like syscalls like syscall.send and syscall.resolve). The kernel will also have a single inbound queue for device input events like bridge-inbound and deliver-inbound and something for the timer device. The kernel scheduler will decide which queue to service next, based on criteria like priority, how many messages a vat has emitted recently, how many messages have been delivered into the vat, etc. Hopefully we can find an algorithm that allows short multi-vat operations to complete fairly promptly, but larger ones to be scheduled fairly.

kernel-scheduler-1

The scheduler will be invoked multiple times during the processing of a block. The input will be the size of all the queues (2*numVats+1) and the computron budget remaining. The output will be which queue to service, and how many items to process. If the output is "none", the block ends and control returns to the host.

When a vat's input queue is serviced, the kernel performs a "delivery crank", whose nominal behavior is to pull the VDO from the head of the input queue and deliver it to the vat, which produces zero or more output events, all of which are pushed onto the tail of the output queue. When the output queue is serviced, the kernel performs a "routing crank", which examines the item and routes any sends to some other vat's input queue (or enqueues them on a promise queue in the kernelDB), and process any resolves by updating the kernel promise table and maybe enqueueing notify events onto input queues (as well as transferring any queued messages to the new target vat). When the kernel's singular inbound queue is serviced, it also performs a routing crank, to route the message to some device or handling vat.

kernel-scheduler-2

The scheduler's decisions form a total ordering of all input/output events. This ordering is in-consensus, and each event should produce a hash (like the current crankHash) which can be folded into the shared state, so divergence is detected quickly. The total order is also broken up into blocks, and the kernel does not run (nor can it change state) between the blocks, which is when the consensus algorithm does its voting work.

kernel-scheduler-3

That ordering is too constrained to allow much room for performance improvements, but fortunately it is merely the kernel's required order. We can achieve parallel execution between vats, and also perform execution during the voting time, by allowing each vat to run independently, and only observing a partial order that fits within the kernel-wide total order. The vat charges ahead and executes as many deliveries as it can (out of the ones it has been given so far), and the kernel merely refrains from collecting the results until the appointed moment.

To take advantage of this, as soon as the routing crank pushes a VDO onto the tail of vat-A's input queue, it should immediately call workerA.submitInput() with the VDO, instead of waiting for it to reach the head of that queue. That allows the worker to execute the crank in the background, unsynchronized with anything else, well in advance of when the scheduler actually decides to process that item on the input queue. We might fiddle with the nice settings on each worker to influence the OS kernel's execution priority, but in general we just want to fill each worker with work.

kernel-scheduler-4

Suppose VDOs 1/2/3 are pushed onto the input queue. All three have been submitInput()ed to the worker, but none have been collected. At some point, the scheduler will process the input queue and "deliver" VDO-1 to the vat. What it really does at that point is to call await collectOutput(1), which tells the worker "ok now I really do need that execution", and blocks if the worker didn't manage to get that far yet. The output is pushed onto the output queue for later processing, however if we can process that output quickly (and submitInput() the results to some other vat quickly), that will keep the pipelines as full as possible.

There is a tradeoff between fairness (which wants to limit the servicing of a deep output-queue), and using parallelism to improve performance (by feeding more work to other vats). There is also a tradeoff between head-of-line latency (which is minimized by only doing one large-scale operation at a time) and performance (having lots of tasks available to keep the pipelines full).

At some point the kernel scheduler decides that we've done enough work for the block. The workers continue on in the background, but the kernek won't call collectOutput() again until the next block. Instead, the kernel calls commit() (in parallel) on all workers with which it interacted during this block. That ensures that all the previous submitInput and collectOutput data will not be forgotten by a worker which crashes in the meantime. Once all workers have completed their commit, the kernel can commit its own DB (which means it will never again submit or collect those steps). Then the kernel can provide results (chain-storage events, etc) back to the host application.

When the kernel starts the next block (actually when sends the first message of the block to any given vat), it can send a retire() to enable the worker to free up the deliveries that are now safely committed by the kernel. (The kernel could do this as soon as it finishes its own commit, but it saves time on the critical path to defer this until after the host app has regained control).

Rewinding Deliveries

Sometimes, the kernel must be able to rewind a delivery. The main use case we have so far is the dispatch.stopVat just before a vat upgrade: the old vat is stopped, the new vat (with new code) is started, but if the new vat fails to launch, we want to revert the old vat to the point just before stopVat, as if the upgrade were never requested. We can also imagine some forms of metering faults or invariant checks that would justify unwinding a delivery.

The kernel won't know whether or not the delivery should be unwound until sometime after the delivery ouputs are collected. But the kernel can refrain from calling retire() until it knows we no longer need to roll back that far.

The vat worker will support revert(backToStepNum) by retaining at least one heap snapshot with stepNum <= highestRetiredStep + 1. When asked to revert, the worker will kill off any current xsnap process, delete all inputs and outputs beyond backToStepNum from its DB, commit the changes, then start up normally. This will load the most recent heap snapshot into a fresh xsnap and resume execution of the saved inputs. By this re-execution process "early" (before the now-deleted inputs), we wind up with a worker state equivalent to the old step, ready to proceed in a different direction.

(Another option is to not restart the worker right away, and instead wait until the kernel scheduler decides to submit a new input or collect a previous output. However that might induce latency in a later block that could have been paid for during earlier idle time.)

Rewinding a worker is not cheap: it takes about as much work as evicting the worker (i.e. to limit concurrent memory consumption) and immediately restarting it. That expense must be considered when deciding how to implement higher-level fault management procedures.

Vat Upgrade, Worker Replacement

Our "baggage-style" vat upgrade process works by shutting down the old worker with a stopVat() delivery, to let it flush any remaining data to its vatstore DB, then launching a new worker with new code, and sending it a startVat() to spin everything back up. The two workers have independent transcripts and run different code: any of SES/xsnap/supervisor/liveslots/vat-bundle/contract-bundle can be changed across the upgrade. Only the vat's identity and the vat store survive (plus all the kernel-side data like c-lists and responsibility for exported kernel objects/promises).

To complicate matters, if either the stopVat or startVat fail, we want to abandon the upgrade, and rewind everything to the point just before the stopVat. So the upgrade process must retain enough data to support the rewind.

Within the CDP model, the CDP state is retained (with modifications), but the CDP runtime worker is replaced.

I think that means our worker must be aware of upgrade, and must be prepared to cooperate with a new worker version (swingset-vat-worker-xsnap-2). The two workers will share a SQLite DB, and the -2 version must not make any changes/migrations that would prevent the -1 from taking over, at least not until the new version is known to be successful.

This suggests:

the DB records the current upgrade-delimited incarnation number (call it upgradeCount)
the transcripts DB table has both upgradeCount and deliveryNum columns, to retain old transcripts
- we might DELETE FROM transcripts WHERE ugpradeCount=oldCount once the ugprade is successful, or we might retain it for future sleeper-agent upgrades that need the full multi-version transcript
the DB tables for inputs and outputs do not need upgradeCount: step numbers are cross-upgrade
the old-version worker manager instance and the function used to construct new-version instances (each from a different package) must coordinate
- perhaps the stopVat is delivered and collected, and commit() called, then the worker instance is shut down
- then a v2 instance is created with a special "we are upgrading this worker state" call (distinct from both the "initialize new state" and "resume from saved state" calls)
- this form increments upgradeCount in the DB and starts working with a new empty transcript
- the kernel submits the startVat and immediately waits for the output
- if this succeeds, we call commit() and a special "you can abandon the old version now" API
- if it fails, we call abandonUpgrade(), which deletes the changes and returns the DB to the previous (post-v1-stopVat) state, and shut down the v2 worker
- then we restart the v1 worker (with the "resume from saved state" API), revert the stopVat, and now we're back in the original state
- the tricky bit is that the v2 startVat must not commit any changes to the vatstore until we we're beyond the possibility of reverting to v1
- that might mean the "we are upgrading" API must insert an extra transaction or savepoint, which is not normally needed, to protect the v1 state against premature changes

We may also want an API to report back the transcripts of each upgraded version, concatenated together, to facilite a sleeper-agent -style "deep replay" upgrade.

Some issues raised in today's kernel meeting:

once a Delivery is submitted to a vat worker, the arrival order on that vat becomes fixed
- that conflicts with having a priority scheduler which allows some deliveries to jump to the head of the line
- we want to keep the pipeline full, for performance
- maybe we should submit the first N deliveries to the worker, but not the rest of the input queue
- a higher-priority message can jump to the Nth place in line, but not the 1st place
- or, we could spend an expensive revert() if we really really wanted it to arrive first
could the kernel pull events from the kernel inbound queue while between blocks?
what sort of per-block vat state can we point to, for consensus checks and state-sync?
- I was thinking that every once in a while (on some well-defined in-consensus schedule), the worker builds a state snapshot
- launch a second process with a read txn on the DB, dump the relevant data, sort it into a canonical order, write it to a file, hash the file, retain the file
- then also record the state changes since that snapshot in a side table
- when the kernel is asked to record a state snapshot and report a snapshot hash (once every 1000 blocks or so, maybe once per day), it asks all workers to do the same. The worker remembers that it needs to retain its snapshot, and reports back a hash of the snapshot plus all the changes since then.
- maybe it just remembers the input events since the snapshot was recorded, and rebuilds the worker state that way
- also we to define how a new worker is built from the snapshot on a new validator: it needs receive all the submitted input events, even though it didn't execute the routing cranks that caused them to be submitted on the original validator
- so maybe the vat state snapshot should also remember the input events that haven't been executed yet

Messages pipelined to a promise are queued into the deciding vat, but that vat might resolve the promise (to an object on some other vat) before the delivery finally executes. Currently we detect this just before the moment of delivery and requeue the VDO onto the target-object owning vat's queue. We've talked about proactively yanking all such VDOs out of the deciding vat's queue at the moment of resolution instead.

For this scheme, yanking it from the worker's queue is more difficult. One possibility is for the deciding vat to react to VDOs that target an unknown promise to just reflect them right back out into a syscall.send. The vat previously knew the vref (vpid), but forgot it when it did the resolve: we may have to revisit that once-synchronous logic in the face of async deliveries (just like we'll need to make the GC calls async-safe).

vref/kref translation happens on the kernel side, independent of the time of delivery/syscall, which changes our synchronous execution model:

the kernel currently removes promises from the c-list just after translating a dispatch.notify (into the vat) or syscall.resolve (out of the vat), because it knows the vat will remove the vref (vpid) from its internal table during receipt of the dispatch.notify or emission of the syscall.resolve
- with those two events separated in time and in different processes, we must tolerate more cases
- the kernel might enqueue a dispatch.notify which resolves p-4 towards the worker, and then some time later the vat emits a syscall.send that references p-4 (because it hasn't yet executed the notify)
- so the kernel can't remove p-4 from the c-list until it knows the vat has received the notify
- this might get entangled/solved by @mhofman 's work to improve our "is this promise really be used by this vat" code (which might turn into a syscall.unsubscribe()). If we manged refcounts for promises in the way that we do for objects, then a dispatch.notify would provoke a syscall.dropPromise() or syscall.dropImport(vpid), and the kernel could know to keep the kpid/vpid pair in the c-list until it received the dropPromise
the kernel currently removes object vref/kref pairs from the c-list when it translates a dispatch.retireImport or a syscall.retireExport, but vats might re-export something while the retire message is still in the queue
- this will look a lot like GC-over-comms, in which messages can cross on the wire
- we solve this in comms by adding sequence numbers to all messages, plus a "highest inbound message seen" counter
- so each inbound message demonstrates awareness (or not) of a specific outbound message
- so we can distinguish between a re-export that crossed on the wire, from one that was made with full knowledge of the drop
- we might adapt this to the kernel/vat interaction
- or we might introduce reference up/down counts, and the kernel doesn't drop the c-list entry until the counter reaches zero (which only happens after the vat basically acks the kernel's drop)

In practice, we have two kinds of device-node interaction:

chain-storage writes

bundlecap getBundle reads

We actually have some other ones, like timer polls, which do seem like it relies on a synchronous result.

I also see some devices like command, loopbox and plugin, but I'm having a hard time figuring out if/where those are used. Maybe @michaelfig has more context?

Yeah this approach would require changes to our syscall/device model.

we have syscall.send() and syscall.resolve(), which are fire-and-forget message sends (using a result= Promise vpid to prepare for an answer someday)
- and some auxilliary message-like ones like syscall.exit()
- both of these categories are treated as outputs, queued until the kernel comes asking with collectOutput
all the vatstore syscalls are routed to the worker-local vatstore DB, and not included as "outputs" or revealed to the kernel
the remaining one is syscall.callNow() for devices
- we'd replace that with something new, which would be clearly write-only
- chain-storage writes would map pretty naturally to such a call

Bundlecaps would need to be rewritten as read-only immutable "blobcaps", described above, instead of using callNow().

The timer device is no longer intended for general access: only the timer vat should get a reference, and all userspace code should interact with the timer vat instead

vat-bootstrap still gets devices.timer (and is responsible for introducing it to vats.timer), but we expect it to remain closely-held
userspace no longer has a way to ask for a synchronous "what time is it?" answer
if we change our minds on that, I suspect the cleanest approach would be to make time a first-class concept, and have each delivery include a hidden parameter that informs the worker about the kerneltime of the delivery, which can then be available to some sort of syscall.getKernelTime()

These other devices are pretty specialized and would need new approaches.

command is used by the ag-solo process to receive requests/commands from e.g. HTTP-speaking clients. It is designed to be invoked from with a GET/POST handler (e.g. express), and can fire a Promise when the request has finished being dispatched. I'm not sure how it would need to change, but surely yes it would need to change.
loopbox is for testing, it is like the mailbox device used by the comms protocol, but allows two kernels to talk to each other within a single process
plugin allows deploy scripts to invoke arbitrary code (named by a disk pathname to import). I don't know how it's used either

I think I wrote the first version of command at a hackathon and then @michaelfig took it over as he built out the ag-solo and REPL systems. loopbox is probably still my responsibility. @michaelfig wrote plugin and still owns it.

Worker State

Let's look at what state each worker needs to record, and what features that state is supporting.

Delivery Execution

Vats live inside orthogonally-persistent JavaScript runtimes. The primary state is contained within the object graph (the JavaScript "heap") of a running XS engine (although see about the "Vatstore" below). To process a delivery like dispatch.deliver() or dispatch.notify(), we need an XS engine running which contains the vat bundle, the "liveslots" support code, a SES implementation, and a variety of connecting layers (the "supervisor"). This all lives in a separate program named xsnap. We have one running instance of xsnap per online vat worker, and its state evolves one delivery at a time, as it executes each one to completion.

--image: deliveries and in-RAM xsnap worker state Parallel execution - Frame 12

If this program could remain running forever, we might not need anything else. But sooner or later the host computer will get rebooted, and/or there might be too many simultaneous workers and we need to shut some down to make room for more active/important ones. So we need a way to preserve the xsnap internal state. Fortunately, xsnap can be instructed to write its internal state (the JavaScript heap / object-graph) to a file, and later we can launch a new xsnap from this "XS heap snapshot" instead of starting from zero.

Writing a heap snapshot is expensive, so we don't do it after every delivery. Instead, we record a "transcript" of what happens during execution, allowing us to snapshot only once every few hundred deliveries. When we need to reload the worker, we start from the most recent snapshot, then re-execute the transcript to bring the worker back up-to-date, relying on the strictly deterministic execution model of our vats. Each transcript entry allows a unidirectional transition from one xsnap state to the next (like P-frames from video compression), and the heap snapshot provides the starting point (like an I-frame).

The transcript entries include both the delivery and a record of all the syscalls made by the worker during its execution, plus their results. Syscalls (especially syscall.vatstoreGet(key)) can read data from sources that might change later, so it is important for the transcript to remember the original answers. When we replay a transcript entry, the syscalls are simulated, not executed. For syscall.vatstoreGet(key), the replay manager fetches the old answer from the transcript and returns it to the worker. The worker is unaware of the difference between a replay and the original.

The heap snapshot interval drives a tradeoff between normal-runtime performance (more frequent snapshots slows down delivery execution) and restart performance (more transcript entries to re-execute).

--image: deliveries, RAM worker, transcripts, heap snapshots Parallel execution - Frame 13

We want to do as little replay work as possible, so we always start from the most recent snapshot available. That means each time we write out (and commit) a new heap snapshot, we can delete the older transcript entries, as well as any older snapshots.

"Sleeper Agent" replay-based upgrade

We plan to support two forms of vat upgrade. We have implemented the first, named "baggage-style", which uses the adminFacet~.upgrade(newBundleCap) method, and discards all non-durable state. But in case that doesn't work, we're trying to retain enough information to perform a "sleeper-agent" / "replay-style" upgrade (#1691), in which the entire history of the vat (across all baggage-style upgrades) is replayed. To support this, we need to retain the deliveries (VatDeliveryObjects) forever.

This is independent of retaining transcript entries for worker-reload/replay purposes. We do not retain syscalls or syscall results forever, because a future sleeper-agent -style replay can make different syscalls.

revert()

Sometimes, we need to roll back a vat to some recent state. There are two circumstances that might require this.

The first is a failed baggage-style upgrade. We currently implement these upgrades with a dispatch.stopVat() to the old version, then a dispatch.startVat() to a new worker running the new version. If either one fails somehow, we rewind everything back to the state of the old version, just before it received the stopVat. This is currently managed by the kernel.

The second would be if an arbitrary delivery causes some sort of invariant or metering failure, and we decide the best course of action is to pretend that delivery never happened. We don't have any way to express this right now without also terminating the vat: both metering failures and vats requesting self-termination (vatPowers.exitVatWithFailure(), causing syscall.exit()) will unwind the syscalls and then delete the vat entirely. But we can imagine a vatPowers.abortDelivery() that would revert to the previous state and stick around for later deliveries, rather than terminating the whole vat.

If we support this, we'll need to retain heap snapshots and transcript entries for longer than before. We wouldn't keep them around forever (some new retire() API would indicate when we don't need a given state anymore, allowing the worker to delete the old supporting data). But we'd need the heap snapshots back to the one just before the oldest revertable state, plus all transcript entries forward from that point.

We don't anticipate needing to revert a vat back more than a single delivery: the execution model becomes much more difficult to reason about otherwise. We might choose to disallow revert() altogether, except for the vat-upgrade case (and even then it wouldn't be a general-purpose facility: we might use a single shared DB transaction for both the stopVat and startVat, so unwinding the upgrade simply aborts that txn).

--picture: transcripts, snapshots, retirement schedule Parallel execution - Frame 14

Vatstore

To reduce RAM requirements for vat workers, we try to keep all the inactive high-cardinality state on disk. Vats use "virtual objects" via APIs like defineKind(), and "virtual collections" (e.g. makeScalarBigMapStore()), as well as durable variants that survive a baggage-style vat upgrade. Programmers get a model in which these look like in-RAM objects, but behind the scenes we page them in when referenced, and out when released.

These live in a key/value backing store named the "vatstore", accessed with APIs syscall.vatstoreGet(key) and syscall.vatstoreSet(key, value). The vatstore also holds refcounts and import/export status of otherwise ephemeral references (e.g. "is there any virtual data which retains a reference to imported Presence o-4?"). Each vatstoreSet()/vatstoreDelete() causes an UPDATE to the vatstore DB during delivery execution, and these updates are gathered into a single DB transaction that can be committed at the end of a delivery. The vatstore DB thus always contains the state after the most recent delivery.

Worker replay (e.g. when bringing a worker back online) does not reference the vatstore: it uses recorded vatstoreGet responses from transcript instead (since the real DB contents may be too new to be accurate).

Reverting the vat to a previous state requires that the vatstore also be reverted to a previous state.

One approach would be a multi-version vatstore:

the schema becomes (key BLOB, value BLOB, added INTEGER, removed INTEGER|NULL)
adding a new key sets added to the deliveryNum when the addition took place, and leaves removed = NULL
modifying an existing key sets removed on the old row, and adds a new row with a higher added and removed = NULL
deleting a key sets removed on the old row
we wind up with more rows than keys, at least until a prune happens
querying for a key at a particular deliveryNum does SELECT WHERE key= AND added>=when AND (removed = NULL or removed >when) and expects to see either 0 or 1 rows selected
a prune operation does something like DELETE WHERE removed <= when

Another approach would be record an undo buffer (thanks to @Tartuffo for the insight):

add a separate table with (when INTEGER, key BLOB, value BLOB|NULL)
for each delivery, the first time we INSERT a key, record (when, key, NULL)
- the first time we UPDATE a key, grab the original value, and record (when, key, origValue)
- don't update either if we make subsequent modifications within the same delivery
to revert, "play back" the undo buffers one delivery at a time
where the delivery transcripts are a unidirectional forward-only delta, the vatstore undo buffer is a unidirectional backward-only delta
- some features might require a "redo" buffer too: unidirectional forward-only deltas (see below)

The multi-version vatstore adds more overhead to queries and modifications, and would take some experimentation to see how much that affects performance. The undo/redo buffer might be more efficient if the buffer remains short (we retire deliveries quickly, giving up the ability to revert() very far), and may provide better support for worker-state snapshots (see below).

--picture: vatstore, reverse deltas, retirement window Parallel execution - Frame 15 Parallel execution - Frame 16

Worker-state snapshots

The final constraint comes from the need for Tendermint/Cosmos "state-sync" snapshots. These enable a brand new validator to download recent data from other validators, check hashes to ensure its correctness, then initialize their state from the download instead of replaying the entire (months/years-long) history of the chain.

Chains which keep their entire state in the Cosmos-SDK IAVL tree will get this for free, because IAVL is a multi-version Merklized key-value DB. The root hash of every version is included as the block's AppHash, providing validation of the entire KV store. The SDK picks one block height out of every N (perhaps 10k, roughly once per day) to retain as a possible snapshot point. After this block has passed (and in the background), old validators write a serialized copy of that version into a file (actually a bunch of 10MB compressed segments), named after the root hash. New validators ask around for what snapshots are available, learn about the segments, download them, populate a new IAVL tree at that height, compare its root hash against the most recent block, and then launch from that point. They must replay all chain transactions (i.e. all blocks) since that point, but the frequency of snapshots means they won't have to replay more than a days-worth of blocks, rather than every block since the genesis. This makes validator startup take time O(N) in size of the current state vector (more or less dominated by the segment download time, plus some relatively-fixed one-day replay time), rather then O(N) in the age of the chain. This is analogous to what we for vats with XS heap snapshots and transcript replay, but hashed and validated so it's safe to fetch the data from untrusted sources.

Our chain holds considerable state outside the IAVL tree. To add our vat worker state into this system, we need several features:

we need a clear definition of what vat state goes into the kernel state for each block
- complicated by parallel execution, where workers on some validators have executed more deliveries than those on others
efficient computation of a hash that covers enough data to rebuild that state
- cosmos-sdk handlers are not told "this block will be used as a state-sync snapshot" during execution
- effectively any block might wind up being the snapshot point
we must avoid doing as much work as possible
- e.g. cosmos-sdk does not write a snapshot file for every block, only once a day-ish
- there's no marginal cost to computing the validation hash for the snapshot: that happens automatically as part of every block
- the generation of the snapshot data chunks happens in the background, after the hash is published, so it does not delay production of the validation-hash -containing block

(kernel state must be included too, but we'll defer that to later: there's only one kernel, so the problem should be easier to solve)

First, we define the overall state to include only the deliveries that have been collected by the kernel within that particular block. The worker may have seen newer "head start" deliveries, which may or may not have been executed yet, but the state-sync snapshot will ignore these. The new validator's kernel will be responsible for re-submitting them when it starts the new worker.

Now, to rebuild a worker at that state-sync point, we need:

xsnap state: heap snapshot from at/before that point, transcript entries from there to the point
vatstore state
all deliveries from 0 to that point (for potential future replay-style upgrade)
if revert() is supported, we need access to older xsnap/vatstore states too
hashes on all of these, for validation

Of these, old vatstore state is the most complicated. The following design isn't particularly elegant, but might work:

every once in a while, dump the entire vatstore (SELECT key,value FROM vatstore ORDER ASC), serialize it to a file, hash it, compress it
- we might stash the compressed data in the DB too (in a separate table), or just as a file on disk
for each subsequent delivery, record both a backwards delta (for revert()) and a forwards delta
hash the forwards delta and add it to a list
after each delivery, hash both the most recent vatstore-snapshot hash and the list of forward deltas, to compute the "current vatstore state hash"
- if the state-sync snapshot is downloaded, the client will be given the most recent vatstore-snapshot, plus all the deltas, and can recompute this hash as it applies the deltas to rebuild the indicated state
snapshot points are retired much like dropping the ability to revert(), but on a longer schedule
- validators will probably only retain the last two or three state-sync snapshots
- if an old validator deletes a snapshot while the new validator is still downloading it, the download fails and they'll have to start all over again
- depending on the size of the chain, it could take several hours to finish the download
- imagine validators mark the new state-sync snapshot block every noon, and it takes them until 1pm to prepare the segments for download
- the worst-case client arrives at 12:59pm tuesday afternoon and starts downloading monday's snapshot
- we need monday's snapshot to be available until tuesday evening or wednesday morning
we don't need to retain the forward-deltas between the previous snapshot's replay point and the next snapshot's starting point

Parallel execution - Frame 17

The multi-version vatstore DB (with the added/removed columns) might also enable extraction of deltas. The need to hash these deltas quickly (performed on every block, probably during the commit() call) is a hassle, because it would be nice to not touch the extra data at all until someone asks for a snapshot. But we can imagine an extra table which records changes to the primary one (including things like setting removed on an existing row), which behaves like a forward-delta of the table state, and is enough to reconstruct the current version from an earlier snapshot. We would serialize the new rows of this table to compute the validation hash that goes into each block.

The new validator's bringup work would be minimized if we could somehow perform an xsnap JS heap snapshot on all vats at the same time, as that would truncate the transcripts that need to be replayed. However we don't really learn about state-sync snapshot points ahead of time, plus there could be hundreds of vats which are offline (so bringing them online just to write out a fresher snapshot would be pretty expensive). So we should probably just make sure we can handle an XS snapshot made at one point, and a vatstore snapshot made at an unrelated point, and be able to rebuild the overall worker state for any point within the window.

A third approach might be record a vatstore snapshot at the same time as we write out the XS snapshot. Then, to rebuild a later state, we perform a special kind of replay that draws from the deliveries, not the transcript entries, and uses the real vatstore DB. syscall.vatstoreGet() really reads from the DB, vatstoreSet really mutates the DB, but syscall.send() is still ignored.

This would increase the periodic cost of snapshot writes (dumping the whole vatstore is even more expensive than writing an XS snapshot), but would be simpler, and wouldn't require forward deltas. The new validator would have to replay more expensive work (delivery execution, rather than merely applying DB deltas), but perhaps there would be less of it.

Kernel-state snapshots

The kernel state includes run-queues, c-lists, object/promise tables, and meta-information about each vat. It does not include vat heap snapshots, vatstore contents, completed deliveries, or transcript entries.

The "head-start" deliveries (where the kernel has committed to the order of their delivery, and has provided the worker with a copy so it can get ahead, but has not yet collected the results) live in the kernel's queues (in particular the per-vat input queue), and will be (re-)submitted to workers in a brand new validator.

We never need to revert the kernel state (we only commit on block boundaries, and our chain has finality, so we never unwind a block), making this task simpler than with vats.

I think we can follow a similar approach: periodically write out the entire kernel DB into a single snapshot, write out a forward delta after every block, hash the most recent snapshot plus the hashes of the deltas to compute a "kernel state hash". We include both the kernel state hash and the vat worker hashes into a single swingset hash, and submit that to the cosmos-sdk side (i.e. write it into an IAVL slot) at the end of every block.

The vat worker hashes will only change for vats that have been touched during a block: delivery execution does not change their state hash, but collecting results does.

If/when someone asks for a state-sync snapshot, the kernel must spend some time (in the "background") pulling deltas from the DB and writing them to retrievable chunks. The previously-published hash can be used to validate the contents of these chunks.

The multi-version DB is related to a concept of "soft deletes". I saw an article (and some discussion) that described some common practices for this, and one thing that jumped out at me was the idea of using a separate VIEW to hold the "current version" table, to reduces the chances of a mistake (if accessing code fails to pay attention to the deleted_at column, it would get corrupted data). You'd build a view that knows about a specific version, all code that isn't trying to do something special goes through that view, and the raw table is only used by special-purpose code (e.g. pruning the old versions).

Swingset Kernel as a CDP

One constraint is that the outputs/chainSend calls are not allowed to have return values. The most likely example would be a chain-storage read, however I don't think we currently do any of those (there might be a method to return the externally-queryable chain-storage IAVL/RPC path: we'd need to implement that in different way).

The chainSend logic is split in cosmic-swingset, implemented in chain-main.js, but used through 5 objects passed to launch in launch-chain.js:

actionQueue: read and write, synchronously. However this is limited to the start of END_BLOCK
bridgeOutbound: read and write, with results sent synchronously all the way into vats through a Swingset device. However end users are all asynchronous, allowing to make the device behave asynchronously.
mailboxStorage: read and write, however results are only used internally in the mailbox state, and this becomes effectively a sink only until dump at commit time
setActivityhash: write only of a single value
installationPublisher: a publisher driving storage writes, with no reads

The first 3 use the results from chain sends, however only actionQueue actually uses the result synchronously. https://github.com/Agoric/agoric-sdk/pull/6741 looks at making chain sends buffered to enable execution of swingset while cosmos is busy and in a non-deterministic state.

@FUDCo and I were talking about how to get vatstore operations out of the transcript, and how the main reason they're in there is so that we can replay a vat from the snapshot point back up to the present without consulting the vatstore data (which is too new to be referenced by the vatstoreGet calls). He had a really clever idea: instead of the local vatstore DB always holding the latest state (plus deltas to rewind it if necessary), what if we just don't commit until the snapshot point? If the worker is restarted, we have to replay from snapshot anyways, and we have to do all the work of regenerating the vatstore operations anyways, so maybe we just don't remember them so persistently.

That would look like:

we have a separate per-vat SQLite DB for just the vatstore, nothing else
at vat creation time, or just after heap snapshot time, or just after worker restart time (before replay), we open a new txn in that DB
we don't commit that txn until the next heap snapshot time
- we use savepoints to manage per-delivery rollback, i.e. an aborted delivery
replay performs actual vatstore operations, reading the old contents (starting from just after the heap snapshot was taken), mutating the DB as it goes
the transcript that is driving replays contains deliveries and kernel-reaching syscalls (and results), but none of the vatstore syscalls/results: those all come from the real (local) DB

We'd wind up with a bunch of outstanding state in the WAL file, which degrades performance when it gets too big, so we'd need to measure that and see how bad it gets. It would be limited by the computrons-before-heap-snapshot limit, which might be sufficient to keep the perf degradation low enough.

A tricky part would be that all other state (especially whatever transcript parts need to be store) must go into a different DB, which we do commit after each delivery. And that opens up the question of hangovers when we crash between the commit points of the two DBs.

But this might be much simpler than trying to build a versioned DB out of SQLite, either by using a schema with generations and deletion tombstones, or by accumulating a bunch of forward/reverse deltas.

I love the idea.

My main concern is ensuring consistency on replay and whether we want to abandon that. Right now we quickly detect XS or liveslots bugs because they result in vatstore anachrophobias. Without vatstore in transcript, the execution will still diverge but we'll only notice later when something more drastic happens (e.g. different promise resolution order). This may be mitigated by putting delivery metering under consensus check.

I'm wondering however if we shouldn't keep a separate non-exported/local vatstore transcript which could be used to enforce consistency on replay. This could be optional and enabled only on nodes we run? It would also allow us to generate "full transcripts" so that we can replay in tools without implementing an actual vatstore.

that opens up the question of hangovers when we crash between the commit points of the two DBs.

I gave this some thoughts, and along the need to replicate vatstore in vstorage, this actually makes things a bit more complicated than I hoped.

Since I believe we should look at moving vatstore in its own DB separately / before parallel execution, I've written up the details in the existing issue for this: https://github.com/Agoric/agoric-sdk/issues/6254#issuecomment-1467051566

Notes on parallel execution of vats

This is my take on doing this, largely written (at Brian's suggestion) without close study of the discussions above so as to avoid unconsciously prejudicing my thoughts and possibly missing problems or solutions.

Overview

Vats interact with the kernel at three points: dispatch, syscalls, and crank completion.

Dispatch and crank completion are paired:

(1) The kernel transmits a delivery to the vat. The delivery typically, though not necessarily, represents a message from another vat, but it can also represent requests or notifications from the kernel directly. All deliveries originate with entries on the kernel's run queue, which are transmitted to vats in FIFO order (modulo having multiple run queues with different priorities, which, while affecting the order in which messages get delivered, does not affect the fundamental logic of the kernel-vat interaction -- plus, of course, multiple prioritized kernel run queues are not yet actually implemented).

(2) The vat executes, while the kernel simply waits (albeit processing syscalls, about which more shortly).

(3) The vat transmits a completion to the kernel. The completion indicates the status of the dispatch (i.e., success or failure) along with a precis of resources (computrons) consumed.

At any given time, either the kernel is executing or one particular vat is executing, even though the kernel and the vat (in fact, all the various vats) may each be in separate processes (or, potentially, even on separate machines). This costs us in two ways: First, only one thing at a time gets to execute, even though we have the resources to execute N things at a time. Second, each exchange of control from kernel to vat or from vat to kernel introduces significant, unavoidable latency (this latency applies to syscalls as well as to the dispatch/completion handshake).

The opportunities for parallelism here should be obvious, both between the kernel and the vat(s) and among the various vats themselves. The highly synchronous nature of the current vat-kernel interaction exists mostly to simplify implementation, principally in service of preserving determinism in the order of execution of the work specified by the run queue and in the management of vat state that is held in the kernel's database. However, it should be possible to achieve the same ordering though cleverness rather than brute force, and thereby enable things to run considerably faster.

The sequence of deliveries transmitted into a vat can be treated as a FIFO stream. The sequence of completions transmitted out of a vat can also be treated as a FIFO stream. As long as the FIFO ordering is maintained*, there is no reason in principle that vats cannot be made free running and no reason the streams themselves cannot be pipelined. The things in the current implementation that would prevent this from just working out of the box are few and enumerable: (1) syscalls, (2) platform failures (either of processes or of the communications channels between them), and (3) cross-vat event ordering.

(* There is an additional possible complication in ensuring that the vat itself maintains the proper FIFO ordering without engaging in "causal lookahead"; this will be addressed in a separate section below.)

Syscalls

Syscalls represent the other important point of kernel-vat contact aside from delivery and completion. However, the syscalls that a vat makes do not, for the most part, actually require synchronous coordination, since they do not typically have a return value. They can, instead, be viewed simply as outputs from the vat that could, in principle, be bundled with the completion (though potentially taking advantage of pipelining opportunities, i.e., start transmitting the "completion" before it is actually complete). The exceptions to this are (a) the various vatstore requests, which currently require synchronous access to the kernel database, and (b) the 'invoke' syscall, which currently entails a synchronous interaction with the device being invoked.

We can address (a) by giving each vat its own vatstore database. Since operations on one vat's vatstore do not interact in any way with another vat's interactions with its vatstore (nor do vatstore operations interact in any way with the kernel's own access to its own durable state) this ought to be straightforward. It also provides a further opportunity to speed up computation since the logically independent database operations of unrelated entities can then be allowed to overlap arbitrarily without consequence.

Addressing (b) is more complicated. I can see two general strategies for dealing with this: (i) make all device interactions fully asynchronous, thereby allowing them to be made an ordinary part of the regular dispatch/completion flow, or (ii) bind a device exclusively to a given vat for the duration of their interaction, so that even though their interaction is nominally synchronous, it does not flow through the kernel (or other vats) while it is happening -- even if this entailed an interprocess handshake between the vat and the device's pseudo-vat (much like the current vat-kernel interaction), it would still be a net win because most vats don't interact with devices at all and even those that do don't do it much compared to the volume of other things that they do.

Platform failures

I believe issues of process or communications failure can be dealt with by adopting a Waterken-like protocol between kernel and vat. Although the specific contents and meaning of the communications between vat and kernel are quite asymmetric, the communications relationship itself can be largely symmetric: each side retains a record of what it has transmitted, which it retains until it receives acknowledgement that its transmissions have been successfully received and durably recorded by its counterpart (we can leave open for now the design question of whether the acknowledgements themselves are specific acts or whether they piggyback on top of the other normal acts of communication that flow between the two parties). If one side or the other of the relationship crashes, upon restart and reconnection the first thing each party does is communicate to the other the identity (e.g., sequence number) of the last thing it remembers successfully receiving, whereupon its counterpart resumes transmission from the point in the stream corresponding to the first thing that was lost.

As long as vats are engineered to atomically record their end-of-crank state together with their crank completion communications, I think the issues of database commit synchronization between vat and kernel are handled correctly without any additional special logic.

If the communications channel between kernel and vat is broken and then reconnected (which I don't think is currently a thing that can happen but that could change) all the same logic should apply -- from the kernel's perspective, a vat restarting and a vat reconnecting in many ways look the same. An unanswered design question is whether we would ever even want to have such a "reconnection" operation and if so which side of the kernel/vat relationship would be responsible for doing it. In particular, if a vat loses contact with the kernel, I think the best course is probably for it to unilaterally kill its own process (abandoning whatever crank is running) and wait for the kernel to restart it, rather than trying to do some kind of special communications resumption dance. A related open question is what the appropriate policy for dealing with actual kernel failure (as opposed to mere connection loss) should be. If the kernel process dies unexpectedly, what should the vats do? My intuition is that this should be treated by them as a form of loss-of-communications and so follow the corresponding exit-on-connection-loss strategy (i.e., ALL the vats kill themselves). When the kernel restarts, it can assume there are no vats running and that part of its job will be to restart all the vat processes. This seems like the cleanest approach, since in this world we don't have to make the kernel responsible for surveying its environment on startup to determine what other processes are running. However, it does raise one dangling issue that I'm not entirely sure how to deal with: what do we do if the kernel dies, restarts, and attempts to restart the vats, all before some vat processes have noticed and are therefor still running? I think some kind of incarnation number scheme may be able to deal with this; the main problem that I see is that the (old) running vat process still has possession of the vat's database. This may be the biggest unsolved problem in all of this.

Another piece of making vat processes more autonomous is that alongside moving the vatstore into the vat's own database, I think we should also make the vat itself responsible for its snapshot and transcript stores. The jobs of loading from a snapshot and replaying from a transcript should be given to Liveslots rather than the kernel, with the kernel's role in vat restart being to initiate the operation rather than to orchestrate it.

Another related open question is how aggressively a vat should record the communications that are sent to it. Since, in the simplest version of the scheme I'm proposing, deliveries are immediately streamed to vats, the vats could, in principle, durably record those messages as soon (more or less) as they are received, thus minimizing retransmission on restart (I believe, possibly mistakenly, that this is the approach Waterken follows). However, since we expect vat process failures and, therefor, restarts, to be very rare compared to normal, successful cranks, it's probably not the ideal strategy to optimize for the restart case. Further, since the communications pathway is likely an IPC connection rather than, say, a TCP session running over the internet, the communications channel is probably highly reliable as well. Thus it might make the most sense for a vat to record an incoming delivery at the same time it records the associated crank completion. I believe this would maximize speed. To my sensibilities the primary virtue of more aggressive message capture is to allow a vat's message backlog to pile up on disk rather than consuming RAM, which might minimize the need for more complicated backpressure schemes.

Cross-vat event ordering

In the world described above there is no longer a component that corresponds to the kernel run queue per se, at least as it has been hitherto realized. Instead, each vat would have its own local input queue. However, the kernel does need to track not only the order of deliveries on a per-vat basis but the total order of deliveries across all vats, so as to maintain the illusion of synchronous execution of the (now entirely virtual) run queue. For each delivery that is issued to some vat, the kernel will expect to eventually receive a corresponding completion, and these completions must be made to appear to happen in the same order as the original deliveries were issued, even if they in fact arrive back at the kernel in an entirely different order.

For example, if the kernel sends message #1 to vat A, then message #2 to vat B, then message #3 to vat A, but then vat A reports completions for messages #1 and #3 prior to when vat B reports completion for message #2, the kernel must nevertheless act on these completions in order 1, 2, 3. This is because acting on these completions may entail issuing additional deliveries. Message order must be preserved so that even if kernels run by different validators see completions for the same message in different orders from each other, the chain consensus will still reflect a common ordering. I'm sure we will need some kind of interesting kernel data structure for this, which I look forward to programming.

Preventing causal lookahead

If in some cases deliveries can get pipelined to a vat more quickly than it can process them, the vat has the opportunity to look into the future of its message stream prior to handling any particular message. This is almost certainly a thing we don't want to allow. However, we already trust Liveslots to represent the kernel's interests even though it runs on the vat side of the kernel/vat process boundary. One of the things that Liveslots will have to do is ensure that a given crank execution releases agency prior to the next delivery being given to user code. This is probably the path of least resistance for the Liveslots implementation anyway (i.e., it's how things would likely end up working even if we weren't paying attention), but we should nevertheless remain aware of the need to maintain this invariant, especially if we start getting very clever in our efforts to speed things up.

That analysis seem mostly in line with my own understanding.

(ii) bind a device exclusively to a given vat for the duration of their interaction

That is a very interesting suggestion. We had already considered 2 types of devices: readonly data (e.g. bundles) and write-only (async). This would effectively give us back read-write devices under certain conditions. We could start with devices that are only ever allowed to be in the clist of at most one vat before going onto a more complicated dance of acquiring a "lock" (which could be layered on top of single vat ownership).

Another piece of making vat processes more autonomous is that alongside moving the vatstore into the vat's own database, I think we should also make the vat itself responsible for its snapshot and transcript stores. The jobs of loading from a snapshot and replaying from a transcript should be given to Liveslots rather than the kernel, with the kernel's role in vat restart being to initiate the operation rather than to orchestrate it.

Strongly agreed, except for the detail that I see "liveslots" as the piece of the vat logic that runs inside the JS engine executing the vat's user code, and as such would not be the party responsible for handling that part of the vat state. I do see the need for some vat logic to run independently of and handle communication with the kernel. Where this code executes is TBD. It would be great to have it run in the same process as the JS engine running the vat code to avoid further transmission overheads (this is one reason I'm somewhat looking at running XS as a wasm module).

As long as vats are engineered to atomically record their end-of-crank state together with their crank completion communications, I think the issues of database commit synchronization between vat and kernel are handled correctly without any additional special logic.

I think it may not be as simple. The main problem is that the vat and kernel state would be in 2 separate processes, with their own independent DB connections (and likely independent DB themselves). As such there is no atomic commit point. We basically have to deal with hangover inconsistency between the kernel and the vat commit points.

but the total order of deliveries across all vats, so as to maintain the illusion of synchronous execution of the (now entirely virtual) run queue

I think this is mostly the concern of #5025. In that world I don't believe there should remain a presumptive total ordering of messages. There must be a deterministic effective ordering, but it does not have to be dependent on the relative order of deliveries between vats.

For example, if the kernel sends message #1 to vat A, then message #2 to vat B, then message #3 to vat A, but then vat A reports completions for messages #1 and #3 prior to when vat B reports completion for message #2, the kernel must nevertheless act on these completions in order 1, 2, 3.

I would argue that order "1, 3, 2" or "2, 1, 3" are just as valid. The important part is that this order must be deterministic between validators. In the way I see it, the host embedding swingset will have some input in deciding the effective order between vats.

I would argue that order "1, 3, 2" or "2, 1, 3" are just as valid. The important part is that this order must be deterministic between validators. In the way I see it, the host embedding swingset will have some input in deciding the effective order between vats.

Indeed, both orders are arguably valid, but in order to achieve determinism we need an objective rule for all validators to follow that everybody can understand, and matching delivery order seems like a good, clean one. But my main thinking was that this scheme makes the overt behavior of the parallelized swingset match the current implementation, so that the apparent nature of causality (as exhibited in slogs and other historical records) does not change at the point when we switch over to a parallelized implementation. It also allows us to swap in a non-parallelized implementation for debugging purposes without disruption.

I expect the switch away from a global run queue and to per vat queues to happen first before this parallel execution change.

That said a mode to have the actual vat executions be globally sequential and match the kernel/host deterministically chosen order of processing the vats input/output queues is indeed valuable for debugging and likely a matter of local configuration (given we take this as a requirement in our implementation).

Edit: To summarize my point of view, I expect that any observable changes to the processing order of messages be a consequence of moving to per vat queues, and that subsequently switching to a parallelized execution does not introduce further changes to the order of message processing.

Note to self: the VBANK_GRAB device invocation returns a balance of some sort (maybe of the account being modified?). I don't know if we use the result, but if so, that might be a device read which interferes with this scheme, and we might need to find a way to do without the result.

e.g.

{"type":"syscall","crankNum":458161,"vatID":"v10","deliveryNum":39106,"syscallNum":2,"replay":false,"ksc":["invoke","kd30","callOutbound",{"body":"#[\"bank\",{\"amount\":\"6300000\",\"denom\":\"ibc/BA313C4A19DFBF943586C0387E6B11286F9E416B4DD27574E6909CABE0E342FA\",\"sender\":\"agoric1hz048cqyvk2lvrkr8g2uchvpjfkkza3sr36em9\",\"type\":\"VBANK_GRAB\"}]","slots":[]}],"vsc":["callNow","d-70","callOutbound",{"body":"#[\"bank\",{\"amount\":\"6300000\",\"denom\":\"ibc/BA313C4A19DFBF943586C0387E6B11286F9E416B4DD27574E6909CABE0E342FA\",\"sender\":\"agoric1hz048cqyvk2lvrkr8g2uchvpjfkkza3sr36em9\",\"type\":\"VBANK_GRAB\"}]","slots":[]}],"time":1687470826.455111,"monotime":272605.152155123}
{"type":"syscall-result","crankNum":458161,"vatID":"v10","deliveryNum":39106,"syscallNum":2,"replay":false,"ksr":["ok",{"body":"#{\"nonce\":569157,\"type\":\"VBANK_BALANCE_UPDATE\",\"updated\":[{\"address\":\"agoric1hz048cqyvk2lvrkr8g2uchvpjfkkza3sr36em9\",\"amount\":\"0\",\"denom\":\"ibc/BA313C4A19DFBF943586C0387E6B11286F9E416B4DD27574E6909CABE0E342FA\"}]}","slots":[]}],"vsr":["ok",{"body":"#{\"nonce\":569157,\"type\":\"VBANK_BALANCE_UPDATE\",\"updated\":[{\"address\":\"agoric1hz048cqyvk2lvrkr8g2uchvpjfkkza3sr36em9\",\"amount\":\"0\",\"denom\":\"ibc/BA313C4A19DFBF943586C0387E6B11286F9E416B4DD27574E6909CABE0E342FA\"}]}","slots":[]}],"time":1687470826.45542,"monotime":272605.15246458}

@warner all these go through the bridge device, and the chain storage vat, which means they're already async and in a different crank. In #6741 I make the bridge handling fully async so that the bridge outbound can return a result in a separate block.

TLDR all bridge devices usages are compatible with background execution, even the ones that return a result

TIL about SQLite support for concurrent transactions, documented at https://www.sqlite.org/cgi/src/doc/begin-concurrent/doc/begin_concurrent.md

It behaves as you'd expect: the second txn fails at COMMIT time if it overlaps with an earlier one, and "overlap" happens when both touch the same table or index at "nearby" keys (so in a large table, two modifications of random keys will probably not overlap, but modifying two sequential keys will almost certainly overlap).

I'm not sure how we could use this, but it's worth remembering. If every vat's vatStore were in a separate table, then the workers could conceivably commit to a kernel-side shared table.

Or the queues with which the kernel and worker talk to each other could be placed in a shared DB file, with separate tables for the kernel-to-worker and the worker-to-kernel directions. In that world, the netstring pipe might only be used to trigger the other side to read from the incoming table (OTOH updating the "tail" pointer, to delete messages, would require more coordination, not unlike what the real comms vat mailbox protocol does: kernel tells worker how far it's read, worker deletes the stale entries when it gets a chance).

The utility of that probably depends upon how exactly the atomicity domain should be shaped. If the kernel has its own DB, in which it remembers both "I've told the worker about delivery 4" and all the other changes that the kernel is tracking, then sending something to the worker may or may not have been received and committed by the worker. Or, if the kernel tracks this memory in the worker's DB, then the kernel might not have committed some related state by that point. Each extra DB means an extra set of interrupt-between-commits cases to analyze.

I am skeptical of any parallelization scheme that involves going back to the kernel process for every vatStore operation. I would really like the interface between kernel and vat to be akin to a machine to machine channel, similar to CapTP

One open question is how the kernel-side scheduler should work. There will be an in-consensus "kernel order", a sequence of events like "process kernel IO input event 4" (eg a timer or bridge update), "deliver (retire) vat 5 input-queue event 6", or "process vat 5 output-queue event 7". The vat input events have hopefully been executed earlier, but this is the point at which we demand their results, and any replica which didn't manage to execute that input by now will be forced to block until it is complete.

@erights and I talked today about whether we'd benefit from sampled non-determinism instead of an algorithmically in-consensus scheme. The latter would mean all replicas would share some policy, by which they would decide (independently) which input/output events must be processed/retired when.

The former would mean that the block proposer would decide, perhaps by a special kind of transaction. There would be exactly one txn of this type in each block, and it would dictate which events must be processed, and in which order. One possible benefit is that the order might be more realistic: the block proposer could sample their own workers, find out which deliveries had been processed already, and then declare that set to be the official one. If other replicas had similar hardware, and their local (non-deterministic) worker schedulers had made similar progress, then they would also have the same deliveries done by "background" execution by the time the block needed to be executed, and they could vote immediately, instead of stalling for "foreground" execution to catch up.

Of course, this means the block proposer is in a favored position. Also, we must guard against the proposer asking for the impossible. Replicas should not vote for a block that demands they retire invalid deliveries. But I think this introduces some lock-step-ness into the schedule: if block-A executes delivery-1, and delivery-1 creates delivery-2, then the proposer of block-A might demand the retirement of both delivery-1 and delivery-2. But a replica only knows about delivery-1 so far (it won't discover -2 until it executes -1, and it hasn't pre-executed that far yet). So the replica cannot vote for the proposal until it has executed delivery-1 and become convinced that delivery-2 could be executed, otherwise it might be approving a chain-halting action. And it cannot vote against the proposal until it knows for sure that delivery-2 can never exist (perhaps because some alternative delivery got enqueued with the same event number). So I'm not sure how to safely take advantage of this.

Agoric / agoric-sdk