Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
326 stars 206 forks source link

scheme for parallel/background vat execution #6447

Open warner opened 1 year ago

warner commented 1 year ago

This ticket documents a design that should allow swingset to execute deliveries on multiple vats in parallel, continuously (even while the kernel is not inside a controller.run(), such as during the 5 seconds that Tendermint/Cosmos-SDK spends doing voting/consensus work), while still maintaining a consensus order of deliveries. We'd like this to improve performance over our current scheme, which only executes one vat delivery at a time, single-threaded, and only within controller.run.

The first section defines the "CDP" model, upon which the rest is based. The second section describes how to rewrite the vat worker to fit this model, and how DB commits are arranged to avoid "hangover inconsistency".

CDPs: Communicating Deterministic Processes

This section defines a "CDP" (Communicating Deterministic Process). This is a refinement of CSP (where the S means "Sequential") and the Waterken model, designed to tolerate processes being terminated at unpredictable times, but also to make more opportunities for parallelism.

A CDP is an abstract process which evolves through a sequence of "steps", each of which accepts an input and produces an output. At its core, the CDP is defined by a completely deterministic transition function (of input and previous state) which produces an output and the new state.

CDP-1

The CDP might exist as a standalone process, or it might share a process with another CDP. Each CDP has it's own commit points, and interact in such a way that one CDP can crash without undermining the assumptions and reliances that other CDPs might have on it.

All blockchains with finalization (e.g. Cosmos-SDK -based chains) behave like a CDP, in which each step is called a "block" or "blockHeight", and the inputs are the contents of the finalized block (mostly the set of transactions to be executed). In the Agoric system, the host application (cosmic-swingset / agd) acts as one CDP, the swingset kernel acts as a second (because of the separate kerneldb commit point), and there are additional CDPs for each of the N active vats.

State vs Runtime Incarnation

We split the CDP into "state" and "runtime". The state is long-lived, created when the CDP is initialized, and surviving until we really don't need it anymore. The state is recorded in a database, and each commit records one or more steps (never a partial step). The state can only be changed by an active runtime.

The runtime is a live OS process (either standalone or embedded in some other process, possibly shared with other CDPs). At any moment, each CDP has either exactly one active runtime (the CDP is currently "online"), or zero ("offline"). CDPs may be kept offline if they are not currently in use (because keeping one around costs some RAM even if nobody talks to it), or they may be preemptively brought online (because starting one costs some CPU time and latency, slowing down response time). All the API calls are made on a runtime, so offline CDPs cannot be manipulated.

Each time a runtime is created, we get a new "runtime incarnation". An CDP's current incarnation may remember slightly more than its persistent state, e.g. steps that it has performed but which have not yet been committed. When an incarnation is shut down or killed, the next incarnation may not remember those steps.

Inputs and Ouptuts

The submitInput(stepNum, input) method is used to submit a step input, and collectOutput(stepNum) is used to collect a step output. The type/shape of the inputs and outputs are specific to the CDP and its state transformation function.

submitInput() does not block, returns nothing, and is not obligated to execute the given step right away. It can be called multiple times (without intervening collectOutput calls) to fill the execution pipeline, allowing the CDP to perform work in the background or in parallel with other processes. Within any single incarnation (and barring a revert, described below), the sequence of submitInput() calls must use sequential stepNum values: no gaps, no repeats.

collectOutput(stepNum) is an async function whose return Promise does not resolve until the given step has been executed. If the CDP managed to process the requested step in the background, collectOutput may resolve immediately, otherwise the caller must wait until enough steps have been executed to produce the requested answer. It is an error to collect an output of a stepNum for which the CDP has never been given an input (although the input may have been given to an earlier incarnation). As with submitInput, within a single incarnation, the outputs must be collected from sequential stepNum values, with no gaps or repeats.

CDP-3

In general, the caller should provide inputs as soon as they are known (and well before the outputs are needed), to keep the pipeline full and maximize the opportunities for efficient parallel execution. If each submitInput() is followed by an immediate await collectOutput() for the same stepNum, there will be few opportunities for parallelism.

In the Agoric system, the kernel CDP only performs one step (block) at a time, but vats may be able to execute many steps (deliveries) in parallel.

Replayed Steps

When a CDP runtime is brought online, its DB will remember some number of inputs, and some number of outputs, based upon which steps had been completed before the last commit() performed by its predecessor.

The caller is using a different database, with different commit points. So the caller may have submitted an input and collected and output for e.g. step 5, but then crashed before it was able to perform its own commit. When the caller starts back up again, it will re-submit the input for step 5. This is called a "replayed input", and CDPs must tolerate them.

There are two cases. In the first, the CDP received submitInput(5, input5) but crashed before a commit(). In this case, the second incarnation can trivially tolerate the replay, because it does not remember the original. We rely upon the caller submitting exactly the same input, to avoid confusion.

In the other, the CDP did commit() after receiving input5, and perhaps others. We call the highest such committed input "highestSubmittedInputStep". We track another number named "highestRetiredStep" (where retire is described below).

When a CDP runtime is started, the new incarnation's first submitInput() call will accept any stepNum in a range from "highestRetiredStep" (exclusive) to "highestSubmittedInputStep" (inclusive). The second submitInput() call must use a stepNum exactly one higher than the first one, etc.

If the CDP detects that a submitInput() call is replaying an input, it refrains from actually executing the step. The "next expected stepNum" counter is incremented (or initialized), but no other state changes are made. The caller cannot tell, but CDP merely "pretends" to perform the step.

A CDP should ideally have a way to compare the replayed inputs it receives against their originals, to throw an error if they diverge. This might help detect errors made by the caller. A single hash of the input would be sufficient to detect divergence and throw an error, however to actually diagnose the problem, it would be better to remember the complete input (so the error message can display both original and the faulty replay). However given that a replay of an uncommitted original cannot be detected by the CDP, it's not worth putting too much energy or code into remembering the originals.

Likewise, the caller may ask for collectOutput(7) from one incarnation, then crash before persisting the results. When the caller restarts, it creates a new CDP runtime incarnation, and asks it (again) for collectOutput(7). The CDP must return the same outputs that its predecessor did. As above, the first collectOutput() call that the new incarnation receives is allowed to use a stepNum that meets highestRetiredStep < stepNum <= highestSubmittedInputStep, and all successive calls within that incarnation must use the next highest sequential stepNum. The CDP also tracks highestCollectedOutputStep.

This means that each time a step is executed, the CDP store the full output into its DB, so it can report a consistent+stable value to the caller (now, soon, or from some future incarnation).

Commit

To improve performance, the CDP is not obligated to commit its DB after every single step. It must never commit the partial states that occur in the middle of a step, but it is allowed to commit at any point between two complete steps.

But to enable the caller to rely upon the stability of the outputs, the caller can insist upon a commit point, by calling the CDP's commit() API. This async method will not resolve until the state and outputs of every step up to and including highestCollectedOutputStep has been committed. The CDP may have performed more work (whose output hasn't been collected yet): if so, these extra steps are committed too. But the caller is only allowed to rely upon the retention of the steps that it has collected.

In general, the CDP should use DB transactions or nested transactions to ensure that only complete steps might be committed, but refrain from doing an actual commit until requested by commit(). If the RAM requirements of the uncommitted base transaction becomes an issue, it can perform an unsolicited commit without causing problems, but minimizing commits (and their slow fsync() calls) will generally yield better performance.

Note: this is specifically intended to take advantage of SQLite's fast "WAL" mode, with PRAGMA synchronous = NORMAL. In this mode, the DB only records changes when a transaction is closed (preventing partial-step commits), but the DB is vulnerable to power failures and host OS crashes until a "checkpoint" occurs. These checkpoints happen spontaneously when the WAL grows large enough (so we must tolerate them), but can be forced with a PRAGMA wal_checkpoint(FULL). The CDP commit() API should force a checkpoint before returning.

Retire

To keep the CDP's "old-outputs storage" obligation bounded, the caller tells the CDP when it is safe to "retire" a step.

Once the caller collects the outputs of e.g. step 7, it will perform some other work, and then eventually commit its own state. Once the caller commits, it can use retire(7) to inform the CDP that it will never ask for the output of step 7 again (nor will it attempt to submit the inputs for step 7). The CDP can then delete the saved output for that step, as well as whatever diagnostic information it retained about the inputs (perhaps a hash, perhaps the full input data). The caller cannot safely retire a step until it has itself committed, otherwise a crash (after the collectOutput(7) but before the callers' commit) would violate the caller's promise. The CDP won't necessarily commit its memory of the retirement promptly (it is not obligated to do so until the commit() API is called), but since it is allowed to commit earlier, the caller must not reveal its intention to retire() until it is really ready for the old data to go away.

The caller might collect the outputs of multiple steps (7, 8, 9) before it gets to a convenient commit point. In that case, it only has to call retire(9), and that will implicitly retire all earlier steps.

A call to retire(9) will also set highestRetiredStep = 9. This value lives in the DB, but is not deserving of its own commit. The value will be committed as a side effect of the next step-execution or commit() -triggered DB commit.

In general, the a non-pipelining caller will execute in a loop like the following (suppose the recorded stepNum in the first pass is 6, so the first input submitted is for step 7):

The caller/CDP might crash at various points in this loop:

By performing the retire() at the beginning of the loop, the caller will start each incarnation by clearing out the unnecessary data from its predecessor.

Revert

Some CDPs are one-way, but others can support the revert(backToStepNum) operation. If supported, the caller can instruct the CDP to revert its internal state back to that of some previous step, at which point the caller is allowed to submit different inputs (and can observe different outputs) than before.

revert is generally expensive, and should only be used for exceptional cases. In the Agoric system, the primary example is an important vat whose latest delivery causes an internal fault, which we think we can repair through some sort of vat upgrade process. By reverting the vat to the state just before the fatal delivery, we can arrange for the next delivery to be something which changes/upgrades the internal state (to fix the bug). Another possibility is a metering fault (too much computation), where we don't want the vat to get away with using more computrons than it paid for, so we unwind the expensive delivery and require the vat to pay their compute bills before allowing that delivery to be repeated (note there are some questionable economics here, but the example is still useful). An Agoric vat worker would implement revert by discarding the xsnap worker and building a new one from the most recent heap snapshot (which must have been made before the target step), and omitting a few deliveries from the end of the transcript it replays (everything past the target step). It would also delete the omitted transcript entries, so they don't accidentally reappear in some future incarnation.

There is a limit to how many steps can be reverted: backToStepNum > highestRetiredStep, where highestRetiredStep is updated by each retire() call. A revert-capable CDP must remember (or be capable of regenerating) all old states back to highestRetiredStep + 1. For the Agoric vat worker, that means we must retain at least one heap snapshot with stepNum <= highestRetiredStep + 1 (which might cause us to remember multiple heap snapshots for a single vat).

API Summary

The library that provides a CDP with a given transition function will provide the following API:

Swingset Kernel as a CDP

Now that we have the CDP framework, how are the various Agoric/Swingset layers expressed?

The cosmic-swingset layer is already pretty well suited. As a block is processed, our cosmos-sdk Handlers push "actions" like bridgeInbound and deliverInbound (defined in https://github.com/Agoric/agoric-sdk/blob/master/packages/cosmic-swingset/src/action-types.js) into a queue on the Golang side. When we get to the END_BLOCK event (https://github.com/Agoric/agoric-sdk/blob/master/packages/cosmic-swingset/src/launch-chain.js), the JS code submits these actions into the kernel, and then runs the kernel until it reaches the runPolicy limit. As the kernel runs, it emits responses (chainSend) like storage writes. Between chain-main.js and launch-chain.js, the cosmic-swingset package has code that tolerates one block's worth of replay, to handle the case where the kernel commits its DB but the process crashes before cosmos-sdk can commit its own.

We can thus define cosmic-swingset's kernel as a non-reverting CDP whose "steps" are blocks, whose inputs are the "actions", and whose outputs are the chainSend messages. The host application performs steps without any sort of pipelining: each submitInput is immediately followed by a matching await collectOutput(), then an await commit(). The retire method is omitted, and the kernel is implicitly allowed to forget everything beyond one step in the past.

One constraint is that the outputs/chainSend calls are not allowed to have return values. The most likely example would be a chain-storage read, however I don't think we currently do any of those (there might be a method to return the externally-queryable chain-storage IAVL/RPC path: we'd need to implement that in different way).

Swingset Vat as a CDP

To take advantage of the parallelism offered by CDPs, we need to make some changes to the kernel/worker boundary:

With vatstoreGet/vatstoreGetNext/etc out of the way, syscall.callNow (i.e. device-node invocation) is the only remaining syscall with a non-trivial return value. In practice, we have two kinds of device-node interaction:

The chain-storage writes do not have a return value, so we really only need an answer for getBundle. I think we can manage this by following through on "blobcaps" (#46), which would be a new form of kref/vref (e.g. bNN and vb-NN), that refers to some sequence of immutable bytes. The kernel needs to pay attention to the vrefs being translated in a VatDeliveryObject and notice when a new one has been added to the c-list. That delivery should include a copy of the data, so the worker can stash it in its DB for later access. We'll probably need refcounts on these, or (given how infrequently we use them) we might just decide to declare that vat workers remember the blob forever.

Once that's done, we can express each vat as a CDP where:

To satisfy other compatibility/stability requirements, I think we'll wind up with all of the vat worker's components (endo/lockdown, supervisor, liveslots) living in a single swingset-vat-worker-xsnap-1 package. The kernel will no longer send the lockdown/supervisor bundles into the worker; instead they'll be bundled by the worker library itself and fed to xsnap as needed. The worker's DB will include any heap snapshots and transcripts that it needs to satisfy the CDP API, as well as additional calls to perform a "sleeper-agent" full-transcript-replay kind of upgrade (#1691).

The -1 suffix is a version identifier that combines everything which could influence the deterministic behavior. Once you've deployed a vat with the -1 worker, its behavior will remain stable despite any changes on the kernel side. No new versions of swingset-vat-worker-xsnap-1 will change that behavior: at most they will fix/improve things that do not threaten determinism. To make any significant changes (upgrade liveslots, new version of XS, etc), we must release a new package named swingset-vat-worker-xsnap-2, and (baggage-style) upgrade vats to use it. The first swingset kernel package will depend upon the -1 package, the second can depend upon both -1 and -2, the third can depend solely upon -2 (and all vat upgrades must be completed before you can upgrade a deployment from the second kernel package to the third).

Kernel Scheduler

Our improved scheduling plans call for each vat to have an input queue (of deliveries), and an output queue (of delivery results, including message-like syscalls like syscall.send and syscall.resolve). The kernel will also have a single inbound queue for device input events like bridge-inbound and deliver-inbound and something for the timer device. The kernel scheduler will decide which queue to service next, based on criteria like priority, how many messages a vat has emitted recently, how many messages have been delivered into the vat, etc. Hopefully we can find an algorithm that allows short multi-vat operations to complete fairly promptly, but larger ones to be scheduled fairly.

kernel-scheduler-1

The scheduler will be invoked multiple times during the processing of a block. The input will be the size of all the queues (2*numVats+1) and the computron budget remaining. The output will be which queue to service, and how many items to process. If the output is "none", the block ends and control returns to the host.

When a vat's input queue is serviced, the kernel performs a "delivery crank", whose nominal behavior is to pull the VDO from the head of the input queue and deliver it to the vat, which produces zero or more output events, all of which are pushed onto the tail of the output queue. When the output queue is serviced, the kernel performs a "routing crank", which examines the item and routes any sends to some other vat's input queue (or enqueues them on a promise queue in the kernelDB), and process any resolves by updating the kernel promise table and maybe enqueueing notify events onto input queues (as well as transferring any queued messages to the new target vat). When the kernel's singular inbound queue is serviced, it also performs a routing crank, to route the message to some device or handling vat.

kernel-scheduler-2

The scheduler's decisions form a total ordering of all input/output events. This ordering is in-consensus, and each event should produce a hash (like the current crankHash) which can be folded into the shared state, so divergence is detected quickly. The total order is also broken up into blocks, and the kernel does not run (nor can it change state) between the blocks, which is when the consensus algorithm does its voting work.

kernel-scheduler-3

That ordering is too constrained to allow much room for performance improvements, but fortunately it is merely the kernel's required order. We can achieve parallel execution between vats, and also perform execution during the voting time, by allowing each vat to run independently, and only observing a partial order that fits within the kernel-wide total order. The vat charges ahead and executes as many deliveries as it can (out of the ones it has been given so far), and the kernel merely refrains from collecting the results until the appointed moment.

To take advantage of this, as soon as the routing crank pushes a VDO onto the tail of vat-A's input queue, it should immediately call workerA.submitInput() with the VDO, instead of waiting for it to reach the head of that queue. That allows the worker to execute the crank in the background, unsynchronized with anything else, well in advance of when the scheduler actually decides to process that item on the input queue. We might fiddle with the nice settings on each worker to influence the OS kernel's execution priority, but in general we just want to fill each worker with work.

kernel-scheduler-4

Suppose VDOs 1/2/3 are pushed onto the input queue. All three have been submitInput()ed to the worker, but none have been collected. At some point, the scheduler will process the input queue and "deliver" VDO-1 to the vat. What it really does at that point is to call await collectOutput(1), which tells the worker "ok now I really do need that execution", and blocks if the worker didn't manage to get that far yet. The output is pushed onto the output queue for later processing, however if we can process that output quickly (and submitInput() the results to some other vat quickly), that will keep the pipelines as full as possible.

There is a tradeoff between fairness (which wants to limit the servicing of a deep output-queue), and using parallelism to improve performance (by feeding more work to other vats). There is also a tradeoff between head-of-line latency (which is minimized by only doing one large-scale operation at a time) and performance (having lots of tasks available to keep the pipelines full).

At some point the kernel scheduler decides that we've done enough work for the block. The workers continue on in the background, but the kernek won't call collectOutput() again until the next block. Instead, the kernel calls commit() (in parallel) on all workers with which it interacted during this block. That ensures that all the previous submitInput and collectOutput data will not be forgotten by a worker which crashes in the meantime. Once all workers have completed their commit, the kernel can commit its own DB (which means it will never again submit or collect those steps). Then the kernel can provide results (chain-storage events, etc) back to the host application.

When the kernel starts the next block (actually when sends the first message of the block to any given vat), it can send a retire() to enable the worker to free up the deliveries that are now safely committed by the kernel. (The kernel could do this as soon as it finishes its own commit, but it saves time on the critical path to defer this until after the host app has regained control).

Rewinding Deliveries

Sometimes, the kernel must be able to rewind a delivery. The main use case we have so far is the dispatch.stopVat just before a vat upgrade: the old vat is stopped, the new vat (with new code) is started, but if the new vat fails to launch, we want to revert the old vat to the point just before stopVat, as if the upgrade were never requested. We can also imagine some forms of metering faults or invariant checks that would justify unwinding a delivery.

The kernel won't know whether or not the delivery should be unwound until sometime after the delivery ouputs are collected. But the kernel can refrain from calling retire() until it knows we no longer need to roll back that far.

The vat worker will support revert(backToStepNum) by retaining at least one heap snapshot with stepNum <= highestRetiredStep + 1. When asked to revert, the worker will kill off any current xsnap process, delete all inputs and outputs beyond backToStepNum from its DB, commit the changes, then start up normally. This will load the most recent heap snapshot into a fresh xsnap and resume execution of the saved inputs. By this re-execution process "early" (before the now-deleted inputs), we wind up with a worker state equivalent to the old step, ready to proceed in a different direction.

(Another option is to not restart the worker right away, and instead wait until the kernel scheduler decides to submit a new input or collect a previous output. However that might induce latency in a later block that could have been paid for during earlier idle time.)

Rewinding a worker is not cheap: it takes about as much work as evicting the worker (i.e. to limit concurrent memory consumption) and immediately restarting it. That expense must be considered when deciding how to implement higher-level fault management procedures.

Vat Upgrade, Worker Replacement

Our "baggage-style" vat upgrade process works by shutting down the old worker with a stopVat() delivery, to let it flush any remaining data to its vatstore DB, then launching a new worker with new code, and sending it a startVat() to spin everything back up. The two workers have independent transcripts and run different code: any of SES/xsnap/supervisor/liveslots/vat-bundle/contract-bundle can be changed across the upgrade. Only the vat's identity and the vat store survive (plus all the kernel-side data like c-lists and responsibility for exported kernel objects/promises).

To complicate matters, if either the stopVat or startVat fail, we want to abandon the upgrade, and rewind everything to the point just before the stopVat. So the upgrade process must retain enough data to support the rewind.

Within the CDP model, the CDP state is retained (with modifications), but the CDP runtime worker is replaced.

I think that means our worker must be aware of upgrade, and must be prepared to cooperate with a new worker version (swingset-vat-worker-xsnap-2). The two workers will share a SQLite DB, and the -2 version must not make any changes/migrations that would prevent the -1 from taking over, at least not until the new version is known to be successful.

This suggests:

We may also want an API to report back the transcripts of each upgraded version, concatenated together, to facilite a sleeper-agent -style "deep replay" upgrade.

warner commented 1 year ago

Some issues raised in today's kernel meeting:

warner commented 1 year ago

Messages pipelined to a promise are queued into the deciding vat, but that vat might resolve the promise (to an object on some other vat) before the delivery finally executes. Currently we detect this just before the moment of delivery and requeue the VDO onto the target-object owning vat's queue. We've talked about proactively yanking all such VDOs out of the deciding vat's queue at the moment of resolution instead.

For this scheme, yanking it from the worker's queue is more difficult. One possibility is for the deciding vat to react to VDOs that target an unknown promise to just reflect them right back out into a syscall.send. The vat previously knew the vref (vpid), but forgot it when it did the resolve: we may have to revisit that once-synchronous logic in the face of async deliveries (just like we'll need to make the GC calls async-safe).

warner commented 1 year ago

vref/kref translation happens on the kernel side, independent of the time of delivery/syscall, which changes our synchronous execution model:

mhofman commented 1 year ago

In practice, we have two kinds of device-node interaction:

  • chain-storage writes
  • bundlecap getBundle reads

We actually have some other ones, like timer polls, which do seem like it relies on a synchronous result.

I also see some devices like command, loopbox and plugin, but I'm having a hard time figuring out if/where those are used. Maybe @michaelfig has more context?

warner commented 1 year ago

Yeah this approach would require changes to our syscall/device model.

Bundlecaps would need to be rewritten as read-only immutable "blobcaps", described above, instead of using callNow().

The timer device is no longer intended for general access: only the timer vat should get a reference, and all userspace code should interact with the timer vat instead

These other devices are pretty specialized and would need new approaches.

I think I wrote the first version of command at a hackathon and then @michaelfig took it over as he built out the ag-solo and REPL systems. loopbox is probably still my responsibility. @michaelfig wrote plugin and still owns it.

warner commented 1 year ago

Worker State

Let's look at what state each worker needs to record, and what features that state is supporting.

Delivery Execution

Vats live inside orthogonally-persistent JavaScript runtimes. The primary state is contained within the object graph (the JavaScript "heap") of a running XS engine (although see about the "Vatstore" below). To process a delivery like dispatch.deliver() or dispatch.notify(), we need an XS engine running which contains the vat bundle, the "liveslots" support code, a SES implementation, and a variety of connecting layers (the "supervisor"). This all lives in a separate program named xsnap. We have one running instance of xsnap per online vat worker, and its state evolves one delivery at a time, as it executes each one to completion.

--image: deliveries and in-RAM xsnap worker state Parallel execution - Frame 12

If this program could remain running forever, we might not need anything else. But sooner or later the host computer will get rebooted, and/or there might be too many simultaneous workers and we need to shut some down to make room for more active/important ones. So we need a way to preserve the xsnap internal state. Fortunately, xsnap can be instructed to write its internal state (the JavaScript heap / object-graph) to a file, and later we can launch a new xsnap from this "XS heap snapshot" instead of starting from zero.

Writing a heap snapshot is expensive, so we don't do it after every delivery. Instead, we record a "transcript" of what happens during execution, allowing us to snapshot only once every few hundred deliveries. When we need to reload the worker, we start from the most recent snapshot, then re-execute the transcript to bring the worker back up-to-date, relying on the strictly deterministic execution model of our vats. Each transcript entry allows a unidirectional transition from one xsnap state to the next (like P-frames from video compression), and the heap snapshot provides the starting point (like an I-frame).

The transcript entries include both the delivery and a record of all the syscalls made by the worker during its execution, plus their results. Syscalls (especially syscall.vatstoreGet(key)) can read data from sources that might change later, so it is important for the transcript to remember the original answers. When we replay a transcript entry, the syscalls are simulated, not executed. For syscall.vatstoreGet(key), the replay manager fetches the old answer from the transcript and returns it to the worker. The worker is unaware of the difference between a replay and the original.

The heap snapshot interval drives a tradeoff between normal-runtime performance (more frequent snapshots slows down delivery execution) and restart performance (more transcript entries to re-execute).

--image: deliveries, RAM worker, transcripts, heap snapshots Parallel execution - Frame 13

We want to do as little replay work as possible, so we always start from the most recent snapshot available. That means each time we write out (and commit) a new heap snapshot, we can delete the older transcript entries, as well as any older snapshots.

"Sleeper Agent" replay-based upgrade

We plan to support two forms of vat upgrade. We have implemented the first, named "baggage-style", which uses the adminFacet~.upgrade(newBundleCap) method, and discards all non-durable state. But in case that doesn't work, we're trying to retain enough information to perform a "sleeper-agent" / "replay-style" upgrade (#1691), in which the entire history of the vat (across all baggage-style upgrades) is replayed. To support this, we need to retain the deliveries (VatDeliveryObjects) forever.

This is independent of retaining transcript entries for worker-reload/replay purposes. We do not retain syscalls or syscall results forever, because a future sleeper-agent -style replay can make different syscalls.

revert()

Sometimes, we need to roll back a vat to some recent state. There are two circumstances that might require this.

The first is a failed baggage-style upgrade. We currently implement these upgrades with a dispatch.stopVat() to the old version, then a dispatch.startVat() to a new worker running the new version. If either one fails somehow, we rewind everything back to the state of the old version, just before it received the stopVat. This is currently managed by the kernel.

The second would be if an arbitrary delivery causes some sort of invariant or metering failure, and we decide the best course of action is to pretend that delivery never happened. We don't have any way to express this right now without also terminating the vat: both metering failures and vats requesting self-termination (vatPowers.exitVatWithFailure(), causing syscall.exit()) will unwind the syscalls and then delete the vat entirely. But we can imagine a vatPowers.abortDelivery() that would revert to the previous state and stick around for later deliveries, rather than terminating the whole vat.

If we support this, we'll need to retain heap snapshots and transcript entries for longer than before. We wouldn't keep them around forever (some new retire() API would indicate when we don't need a given state anymore, allowing the worker to delete the old supporting data). But we'd need the heap snapshots back to the one just before the oldest revertable state, plus all transcript entries forward from that point.

We don't anticipate needing to revert a vat back more than a single delivery: the execution model becomes much more difficult to reason about otherwise. We might choose to disallow revert() altogether, except for the vat-upgrade case (and even then it wouldn't be a general-purpose facility: we might use a single shared DB transaction for both the stopVat and startVat, so unwinding the upgrade simply aborts that txn).

--picture: transcripts, snapshots, retirement schedule Parallel execution - Frame 14

Vatstore

To reduce RAM requirements for vat workers, we try to keep all the inactive high-cardinality state on disk. Vats use "virtual objects" via APIs like defineKind(), and "virtual collections" (e.g. makeScalarBigMapStore()), as well as durable variants that survive a baggage-style vat upgrade. Programmers get a model in which these look like in-RAM objects, but behind the scenes we page them in when referenced, and out when released.

These live in a key/value backing store named the "vatstore", accessed with APIs syscall.vatstoreGet(key) and syscall.vatstoreSet(key, value). The vatstore also holds refcounts and import/export status of otherwise ephemeral references (e.g. "is there any virtual data which retains a reference to imported Presence o-4?"). Each vatstoreSet()/vatstoreDelete() causes an UPDATE to the vatstore DB during delivery execution, and these updates are gathered into a single DB transaction that can be committed at the end of a delivery. The vatstore DB thus always contains the state after the most recent delivery.

Worker replay (e.g. when bringing a worker back online) does not reference the vatstore: it uses recorded vatstoreGet responses from transcript instead (since the real DB contents may be too new to be accurate).

Reverting the vat to a previous state requires that the vatstore also be reverted to a previous state.

One approach would be a multi-version vatstore:

Another approach would be record an undo buffer (thanks to @Tartuffo for the insight):

The multi-version vatstore adds more overhead to queries and modifications, and would take some experimentation to see how much that affects performance. The undo/redo buffer might be more efficient if the buffer remains short (we retire deliveries quickly, giving up the ability to revert() very far), and may provide better support for worker-state snapshots (see below).

--picture: vatstore, reverse deltas, retirement window Parallel execution - Frame 15 Parallel execution - Frame 16

Worker-state snapshots

The final constraint comes from the need for Tendermint/Cosmos "state-sync" snapshots. These enable a brand new validator to download recent data from other validators, check hashes to ensure its correctness, then initialize their state from the download instead of replaying the entire (months/years-long) history of the chain.

Chains which keep their entire state in the Cosmos-SDK IAVL tree will get this for free, because IAVL is a multi-version Merklized key-value DB. The root hash of every version is included as the block's AppHash, providing validation of the entire KV store. The SDK picks one block height out of every N (perhaps 10k, roughly once per day) to retain as a possible snapshot point. After this block has passed (and in the background), old validators write a serialized copy of that version into a file (actually a bunch of 10MB compressed segments), named after the root hash. New validators ask around for what snapshots are available, learn about the segments, download them, populate a new IAVL tree at that height, compare its root hash against the most recent block, and then launch from that point. They must replay all chain transactions (i.e. all blocks) since that point, but the frequency of snapshots means they won't have to replay more than a days-worth of blocks, rather than every block since the genesis. This makes validator startup take time O(N) in size of the current state vector (more or less dominated by the segment download time, plus some relatively-fixed one-day replay time), rather then O(N) in the age of the chain. This is analogous to what we for vats with XS heap snapshots and transcript replay, but hashed and validated so it's safe to fetch the data from untrusted sources.

Our chain holds considerable state outside the IAVL tree. To add our vat worker state into this system, we need several features:

(kernel state must be included too, but we'll defer that to later: there's only one kernel, so the problem should be easier to solve)

First, we define the overall state to include only the deliveries that have been collected by the kernel within that particular block. The worker may have seen newer "head start" deliveries, which may or may not have been executed yet, but the state-sync snapshot will ignore these. The new validator's kernel will be responsible for re-submitting them when it starts the new worker.

Now, to rebuild a worker at that state-sync point, we need:

Of these, old vatstore state is the most complicated. The following design isn't particularly elegant, but might work:

Parallel execution - Frame 17

The multi-version vatstore DB (with the added/removed columns) might also enable extraction of deltas. The need to hash these deltas quickly (performed on every block, probably during the commit() call) is a hassle, because it would be nice to not touch the extra data at all until someone asks for a snapshot. But we can imagine an extra table which records changes to the primary one (including things like setting removed on an existing row), which behaves like a forward-delta of the table state, and is enough to reconstruct the current version from an earlier snapshot. We would serialize the new rows of this table to compute the validation hash that goes into each block.

The new validator's bringup work would be minimized if we could somehow perform an xsnap JS heap snapshot on all vats at the same time, as that would truncate the transcripts that need to be replayed. However we don't really learn about state-sync snapshot points ahead of time, plus there could be hundreds of vats which are offline (so bringing them online just to write out a fresher snapshot would be pretty expensive). So we should probably just make sure we can handle an XS snapshot made at one point, and a vatstore snapshot made at an unrelated point, and be able to rebuild the overall worker state for any point within the window.

A third approach might be record a vatstore snapshot at the same time as we write out the XS snapshot. Then, to rebuild a later state, we perform a special kind of replay that draws from the deliveries, not the transcript entries, and uses the real vatstore DB. syscall.vatstoreGet() really reads from the DB, vatstoreSet really mutates the DB, but syscall.send() is still ignored.

This would increase the periodic cost of snapshot writes (dumping the whole vatstore is even more expensive than writing an XS snapshot), but would be simpler, and wouldn't require forward deltas. The new validator would have to replay more expensive work (delivery execution, rather than merely applying DB deltas), but perhaps there would be less of it.

Kernel-state snapshots

The kernel state includes run-queues, c-lists, object/promise tables, and meta-information about each vat. It does not include vat heap snapshots, vatstore contents, completed deliveries, or transcript entries.

The "head-start" deliveries (where the kernel has committed to the order of their delivery, and has provided the worker with a copy so it can get ahead, but has not yet collected the results) live in the kernel's queues (in particular the per-vat input queue), and will be (re-)submitted to workers in a brand new validator.

We never need to revert the kernel state (we only commit on block boundaries, and our chain has finality, so we never unwind a block), making this task simpler than with vats.

I think we can follow a similar approach: periodically write out the entire kernel DB into a single snapshot, write out a forward delta after every block, hash the most recent snapshot plus the hashes of the deltas to compute a "kernel state hash". We include both the kernel state hash and the vat worker hashes into a single swingset hash, and submit that to the cosmos-sdk side (i.e. write it into an IAVL slot) at the end of every block.

The vat worker hashes will only change for vats that have been touched during a block: delivery execution does not change their state hash, but collecting results does.

If/when someone asks for a state-sync snapshot, the kernel must spend some time (in the "background") pulling deltas from the DB and writing them to retrievable chunks. The previously-published hash can be used to validate the contents of these chunks.

warner commented 1 year ago

The multi-version DB is related to a concept of "soft deletes". I saw an article (and some discussion) that described some common practices for this, and one thing that jumped out at me was the idea of using a separate VIEW to hold the "current version" table, to reduces the chances of a mistake (if accessing code fails to pay attention to the deleted_at column, it would get corrupted data). You'd build a view that knows about a specific version, all code that isn't trying to do something special goes through that view, and the raw table is only used by special-purpose code (e.g. pruning the old versions).

mhofman commented 1 year ago

Swingset Kernel as a CDP

One constraint is that the outputs/chainSend calls are not allowed to have return values. The most likely example would be a chain-storage read, however I don't think we currently do any of those (there might be a method to return the externally-queryable chain-storage IAVL/RPC path: we'd need to implement that in different way).

The chainSend logic is split in cosmic-swingset, implemented in chain-main.js, but used through 5 objects passed to launch in launch-chain.js:

The first 3 use the results from chain sends, however only actionQueue actually uses the result synchronously. https://github.com/Agoric/agoric-sdk/pull/6741 looks at making chain sends buffered to enable execution of swingset while cosmos is busy and in a non-deterministic state.

warner commented 1 year ago

@FUDCo and I were talking about how to get vatstore operations out of the transcript, and how the main reason they're in there is so that we can replay a vat from the snapshot point back up to the present without consulting the vatstore data (which is too new to be referenced by the vatstoreGet calls). He had a really clever idea: instead of the local vatstore DB always holding the latest state (plus deltas to rewind it if necessary), what if we just don't commit until the snapshot point? If the worker is restarted, we have to replay from snapshot anyways, and we have to do all the work of regenerating the vatstore operations anyways, so maybe we just don't remember them so persistently.

That would look like:

We'd wind up with a bunch of outstanding state in the WAL file, which degrades performance when it gets too big, so we'd need to measure that and see how bad it gets. It would be limited by the computrons-before-heap-snapshot limit, which might be sufficient to keep the perf degradation low enough.

A tricky part would be that all other state (especially whatever transcript parts need to be store) must go into a different DB, which we do commit after each delivery. And that opens up the question of hangovers when we crash between the commit points of the two DBs.

But this might be much simpler than trying to build a versioned DB out of SQLite, either by using a schema with generations and deletion tombstones, or by accumulating a bunch of forward/reverse deltas.

mhofman commented 1 year ago

I love the idea.

My main concern is ensuring consistency on replay and whether we want to abandon that. Right now we quickly detect XS or liveslots bugs because they result in vatstore anachrophobias. Without vatstore in transcript, the execution will still diverge but we'll only notice later when something more drastic happens (e.g. different promise resolution order). This may be mitigated by putting delivery metering under consensus check.

I'm wondering however if we shouldn't keep a separate non-exported/local vatstore transcript which could be used to enforce consistency on replay. This could be optional and enabled only on nodes we run? It would also allow us to generate "full transcripts" so that we can replay in tools without implementing an actual vatstore.

mhofman commented 1 year ago

that opens up the question of hangovers when we crash between the commit points of the two DBs.

I gave this some thoughts, and along the need to replicate vatstore in vstorage, this actually makes things a bit more complicated than I hoped.

Since I believe we should look at moving vatstore in its own DB separately / before parallel execution, I've written up the details in the existing issue for this: https://github.com/Agoric/agoric-sdk/issues/6254#issuecomment-1467051566

FUDCo commented 1 year ago

Notes on parallel execution of vats

This is my take on doing this, largely written (at Brian's suggestion) without close study of the discussions above so as to avoid unconsciously prejudicing my thoughts and possibly missing problems or solutions.

Overview

Vats interact with the kernel at three points: dispatch, syscalls, and crank completion.

Dispatch and crank completion are paired:

(1) The kernel transmits a delivery to the vat. The delivery typically, though not necessarily, represents a message from another vat, but it can also represent requests or notifications from the kernel directly. All deliveries originate with entries on the kernel's run queue, which are transmitted to vats in FIFO order (modulo having multiple run queues with different priorities, which, while affecting the order in which messages get delivered, does not affect the fundamental logic of the kernel-vat interaction -- plus, of course, multiple prioritized kernel run queues are not yet actually implemented).

(2) The vat executes, while the kernel simply waits (albeit processing syscalls, about which more shortly).

(3) The vat transmits a completion to the kernel. The completion indicates the status of the dispatch (i.e., success or failure) along with a precis of resources (computrons) consumed.

At any given time, either the kernel is executing or one particular vat is executing, even though the kernel and the vat (in fact, all the various vats) may each be in separate processes (or, potentially, even on separate machines). This costs us in two ways: First, only one thing at a time gets to execute, even though we have the resources to execute N things at a time. Second, each exchange of control from kernel to vat or from vat to kernel introduces significant, unavoidable latency (this latency applies to syscalls as well as to the dispatch/completion handshake).

The opportunities for parallelism here should be obvious, both between the kernel and the vat(s) and among the various vats themselves. The highly synchronous nature of the current vat-kernel interaction exists mostly to simplify implementation, principally in service of preserving determinism in the order of execution of the work specified by the run queue and in the management of vat state that is held in the kernel's database. However, it should be possible to achieve the same ordering though cleverness rather than brute force, and thereby enable things to run considerably faster.

The sequence of deliveries transmitted into a vat can be treated as a FIFO stream. The sequence of completions transmitted out of a vat can also be treated as a FIFO stream. As long as the FIFO ordering is maintained*, there is no reason in principle that vats cannot be made free running and no reason the streams themselves cannot be pipelined. The things in the current implementation that would prevent this from just working out of the box are few and enumerable: (1) syscalls, (2) platform failures (either of processes or of the communications channels between them), and (3) cross-vat event ordering.

(* There is an additional possible complication in ensuring that the vat itself maintains the proper FIFO ordering without engaging in "causal lookahead"; this will be addressed in a separate section below.)

Syscalls

Syscalls represent the other important point of kernel-vat contact aside from delivery and completion. However, the syscalls that a vat makes do not, for the most part, actually require synchronous coordination, since they do not typically have a return value. They can, instead, be viewed simply as outputs from the vat that could, in principle, be bundled with the completion (though potentially taking advantage of pipelining opportunities, i.e., start transmitting the "completion" before it is actually complete). The exceptions to this are (a) the various vatstore requests, which currently require synchronous access to the kernel database, and (b) the 'invoke' syscall, which currently entails a synchronous interaction with the device being invoked.

We can address (a) by giving each vat its own vatstore database. Since operations on one vat's vatstore do not interact in any way with another vat's interactions with its vatstore (nor do vatstore operations interact in any way with the kernel's own access to its own durable state) this ought to be straightforward. It also provides a further opportunity to speed up computation since the logically independent database operations of unrelated entities can then be allowed to overlap arbitrarily without consequence.

Addressing (b) is more complicated. I can see two general strategies for dealing with this: (i) make all device interactions fully asynchronous, thereby allowing them to be made an ordinary part of the regular dispatch/completion flow, or (ii) bind a device exclusively to a given vat for the duration of their interaction, so that even though their interaction is nominally synchronous, it does not flow through the kernel (or other vats) while it is happening -- even if this entailed an interprocess handshake between the vat and the device's pseudo-vat (much like the current vat-kernel interaction), it would still be a net win because most vats don't interact with devices at all and even those that do don't do it much compared to the volume of other things that they do.

Platform failures

I believe issues of process or communications failure can be dealt with by adopting a Waterken-like protocol between kernel and vat. Although the specific contents and meaning of the communications between vat and kernel are quite asymmetric, the communications relationship itself can be largely symmetric: each side retains a record of what it has transmitted, which it retains until it receives acknowledgement that its transmissions have been successfully received and durably recorded by its counterpart (we can leave open for now the design question of whether the acknowledgements themselves are specific acts or whether they piggyback on top of the other normal acts of communication that flow between the two parties). If one side or the other of the relationship crashes, upon restart and reconnection the first thing each party does is communicate to the other the identity (e.g., sequence number) of the last thing it remembers successfully receiving, whereupon its counterpart resumes transmission from the point in the stream corresponding to the first thing that was lost.

As long as vats are engineered to atomically record their end-of-crank state together with their crank completion communications, I think the issues of database commit synchronization between vat and kernel are handled correctly without any additional special logic.

If the communications channel between kernel and vat is broken and then reconnected (which I don't think is currently a thing that can happen but that could change) all the same logic should apply -- from the kernel's perspective, a vat restarting and a vat reconnecting in many ways look the same. An unanswered design question is whether we would ever even want to have such a "reconnection" operation and if so which side of the kernel/vat relationship would be responsible for doing it. In particular, if a vat loses contact with the kernel, I think the best course is probably for it to unilaterally kill its own process (abandoning whatever crank is running) and wait for the kernel to restart it, rather than trying to do some kind of special communications resumption dance. A related open question is what the appropriate policy for dealing with actual kernel failure (as opposed to mere connection loss) should be. If the kernel process dies unexpectedly, what should the vats do? My intuition is that this should be treated by them as a form of loss-of-communications and so follow the corresponding exit-on-connection-loss strategy (i.e., ALL the vats kill themselves). When the kernel restarts, it can assume there are no vats running and that part of its job will be to restart all the vat processes. This seems like the cleanest approach, since in this world we don't have to make the kernel responsible for surveying its environment on startup to determine what other processes are running. However, it does raise one dangling issue that I'm not entirely sure how to deal with: what do we do if the kernel dies, restarts, and attempts to restart the vats, all before some vat processes have noticed and are therefor still running? I think some kind of incarnation number scheme may be able to deal with this; the main problem that I see is that the (old) running vat process still has possession of the vat's database. This may be the biggest unsolved problem in all of this.

Another piece of making vat processes more autonomous is that alongside moving the vatstore into the vat's own database, I think we should also make the vat itself responsible for its snapshot and transcript stores. The jobs of loading from a snapshot and replaying from a transcript should be given to Liveslots rather than the kernel, with the kernel's role in vat restart being to initiate the operation rather than to orchestrate it.

Another related open question is how aggressively a vat should record the communications that are sent to it. Since, in the simplest version of the scheme I'm proposing, deliveries are immediately streamed to vats, the vats could, in principle, durably record those messages as soon (more or less) as they are received, thus minimizing retransmission on restart (I believe, possibly mistakenly, that this is the approach Waterken follows). However, since we expect vat process failures and, therefor, restarts, to be very rare compared to normal, successful cranks, it's probably not the ideal strategy to optimize for the restart case. Further, since the communications pathway is likely an IPC connection rather than, say, a TCP session running over the internet, the communications channel is probably highly reliable as well. Thus it might make the most sense for a vat to record an incoming delivery at the same time it records the associated crank completion. I believe this would maximize speed. To my sensibilities the primary virtue of more aggressive message capture is to allow a vat's message backlog to pile up on disk rather than consuming RAM, which might minimize the need for more complicated backpressure schemes.

Cross-vat event ordering

In the world described above there is no longer a component that corresponds to the kernel run queue per se, at least as it has been hitherto realized. Instead, each vat would have its own local input queue. However, the kernel does need to track not only the order of deliveries on a per-vat basis but the total order of deliveries across all vats, so as to maintain the illusion of synchronous execution of the (now entirely virtual) run queue. For each delivery that is issued to some vat, the kernel will expect to eventually receive a corresponding completion, and these completions must be made to appear to happen in the same order as the original deliveries were issued, even if they in fact arrive back at the kernel in an entirely different order.

For example, if the kernel sends message #1 to vat A, then message #2 to vat B, then message #3 to vat A, but then vat A reports completions for messages #1 and #3 prior to when vat B reports completion for message #2, the kernel must nevertheless act on these completions in order 1, 2, 3. This is because acting on these completions may entail issuing additional deliveries. Message order must be preserved so that even if kernels run by different validators see completions for the same message in different orders from each other, the chain consensus will still reflect a common ordering. I'm sure we will need some kind of interesting kernel data structure for this, which I look forward to programming.

Preventing causal lookahead

If in some cases deliveries can get pipelined to a vat more quickly than it can process them, the vat has the opportunity to look into the future of its message stream prior to handling any particular message. This is almost certainly a thing we don't want to allow. However, we already trust Liveslots to represent the kernel's interests even though it runs on the vat side of the kernel/vat process boundary. One of the things that Liveslots will have to do is ensure that a given crank execution releases agency prior to the next delivery being given to user code. This is probably the path of least resistance for the Liveslots implementation anyway (i.e., it's how things would likely end up working even if we weren't paying attention), but we should nevertheless remain aware of the need to maintain this invariant, especially if we start getting very clever in our efforts to speed things up.

mhofman commented 1 year ago

That analysis seem mostly in line with my own understanding.

(ii) bind a device exclusively to a given vat for the duration of their interaction

That is a very interesting suggestion. We had already considered 2 types of devices: readonly data (e.g. bundles) and write-only (async). This would effectively give us back read-write devices under certain conditions. We could start with devices that are only ever allowed to be in the clist of at most one vat before going onto a more complicated dance of acquiring a "lock" (which could be layered on top of single vat ownership).

Another piece of making vat processes more autonomous is that alongside moving the vatstore into the vat's own database, I think we should also make the vat itself responsible for its snapshot and transcript stores. The jobs of loading from a snapshot and replaying from a transcript should be given to Liveslots rather than the kernel, with the kernel's role in vat restart being to initiate the operation rather than to orchestrate it.

Strongly agreed, except for the detail that I see "liveslots" as the piece of the vat logic that runs inside the JS engine executing the vat's user code, and as such would not be the party responsible for handling that part of the vat state. I do see the need for some vat logic to run independently of and handle communication with the kernel. Where this code executes is TBD. It would be great to have it run in the same process as the JS engine running the vat code to avoid further transmission overheads (this is one reason I'm somewhat looking at running XS as a wasm module).

As long as vats are engineered to atomically record their end-of-crank state together with their crank completion communications, I think the issues of database commit synchronization between vat and kernel are handled correctly without any additional special logic.

I think it may not be as simple. The main problem is that the vat and kernel state would be in 2 separate processes, with their own independent DB connections (and likely independent DB themselves). As such there is no atomic commit point. We basically have to deal with hangover inconsistency between the kernel and the vat commit points.

but the total order of deliveries across all vats, so as to maintain the illusion of synchronous execution of the (now entirely virtual) run queue

I think this is mostly the concern of #5025. In that world I don't believe there should remain a presumptive total ordering of messages. There must be a deterministic effective ordering, but it does not have to be dependent on the relative order of deliveries between vats.

For example, if the kernel sends message #1 to vat A, then message #2 to vat B, then message #3 to vat A, but then vat A reports completions for messages #1 and #3 prior to when vat B reports completion for message #2, the kernel must nevertheless act on these completions in order 1, 2, 3.

I would argue that order "1, 3, 2" or "2, 1, 3" are just as valid. The important part is that this order must be deterministic between validators. In the way I see it, the host embedding swingset will have some input in deciding the effective order between vats.

FUDCo commented 1 year ago

I would argue that order "1, 3, 2" or "2, 1, 3" are just as valid. The important part is that this order must be deterministic between validators. In the way I see it, the host embedding swingset will have some input in deciding the effective order between vats.

Indeed, both orders are arguably valid, but in order to achieve determinism we need an objective rule for all validators to follow that everybody can understand, and matching delivery order seems like a good, clean one. But my main thinking was that this scheme makes the overt behavior of the parallelized swingset match the current implementation, so that the apparent nature of causality (as exhibited in slogs and other historical records) does not change at the point when we switch over to a parallelized implementation. It also allows us to swap in a non-parallelized implementation for debugging purposes without disruption.

mhofman commented 1 year ago

I expect the switch away from a global run queue and to per vat queues to happen first before this parallel execution change.

That said a mode to have the actual vat executions be globally sequential and match the kernel/host deterministically chosen order of processing the vats input/output queues is indeed valuable for debugging and likely a matter of local configuration (given we take this as a requirement in our implementation).

Edit: To summarize my point of view, I expect that any observable changes to the processing order of messages be a consequence of moving to per vat queues, and that subsequently switching to a parallelized execution does not introduce further changes to the order of message processing.

warner commented 1 year ago

Note to self: the VBANK_GRAB device invocation returns a balance of some sort (maybe of the account being modified?). I don't know if we use the result, but if so, that might be a device read which interferes with this scheme, and we might need to find a way to do without the result.

e.g.

{"type":"syscall","crankNum":458161,"vatID":"v10","deliveryNum":39106,"syscallNum":2,"replay":false,"ksc":["invoke","kd30","callOutbound",{"body":"#[\"bank\",{\"amount\":\"6300000\",\"denom\":\"ibc/BA313C4A19DFBF943586C0387E6B11286F9E416B4DD27574E6909CABE0E342FA\",\"sender\":\"agoric1hz048cqyvk2lvrkr8g2uchvpjfkkza3sr36em9\",\"type\":\"VBANK_GRAB\"}]","slots":[]}],"vsc":["callNow","d-70","callOutbound",{"body":"#[\"bank\",{\"amount\":\"6300000\",\"denom\":\"ibc/BA313C4A19DFBF943586C0387E6B11286F9E416B4DD27574E6909CABE0E342FA\",\"sender\":\"agoric1hz048cqyvk2lvrkr8g2uchvpjfkkza3sr36em9\",\"type\":\"VBANK_GRAB\"}]","slots":[]}],"time":1687470826.455111,"monotime":272605.152155123}
{"type":"syscall-result","crankNum":458161,"vatID":"v10","deliveryNum":39106,"syscallNum":2,"replay":false,"ksr":["ok",{"body":"#{\"nonce\":569157,\"type\":\"VBANK_BALANCE_UPDATE\",\"updated\":[{\"address\":\"agoric1hz048cqyvk2lvrkr8g2uchvpjfkkza3sr36em9\",\"amount\":\"0\",\"denom\":\"ibc/BA313C4A19DFBF943586C0387E6B11286F9E416B4DD27574E6909CABE0E342FA\"}]}","slots":[]}],"vsr":["ok",{"body":"#{\"nonce\":569157,\"type\":\"VBANK_BALANCE_UPDATE\",\"updated\":[{\"address\":\"agoric1hz048cqyvk2lvrkr8g2uchvpjfkkza3sr36em9\",\"amount\":\"0\",\"denom\":\"ibc/BA313C4A19DFBF943586C0387E6B11286F9E416B4DD27574E6909CABE0E342FA\"}]}","slots":[]}],"time":1687470826.45542,"monotime":272605.15246458}
mhofman commented 1 year ago

@warner all these go through the bridge device, and the chain storage vat, which means they're already async and in a different crank. In #6741 I make the bridge handling fully async so that the bridge outbound can return a result in a separate block.

TLDR all bridge devices usages are compatible with background execution, even the ones that return a result

warner commented 1 year ago

TIL about SQLite support for concurrent transactions, documented at https://www.sqlite.org/cgi/src/doc/begin-concurrent/doc/begin_concurrent.md

It behaves as you'd expect: the second txn fails at COMMIT time if it overlaps with an earlier one, and "overlap" happens when both touch the same table or index at "nearby" keys (so in a large table, two modifications of random keys will probably not overlap, but modifying two sequential keys will almost certainly overlap).

I'm not sure how we could use this, but it's worth remembering. If every vat's vatStore were in a separate table, then the workers could conceivably commit to a kernel-side shared table.

Or the queues with which the kernel and worker talk to each other could be placed in a shared DB file, with separate tables for the kernel-to-worker and the worker-to-kernel directions. In that world, the netstring pipe might only be used to trigger the other side to read from the incoming table (OTOH updating the "tail" pointer, to delete messages, would require more coordination, not unlike what the real comms vat mailbox protocol does: kernel tells worker how far it's read, worker deletes the stale entries when it gets a chance).

The utility of that probably depends upon how exactly the atomicity domain should be shaped. If the kernel has its own DB, in which it remembers both "I've told the worker about delivery 4" and all the other changes that the kernel is tracking, then sending something to the worker may or may not have been received and committed by the worker. Or, if the kernel tracks this memory in the worker's DB, then the kernel might not have committed some related state by that point. Each extra DB means an extra set of interrupt-between-commits cases to analyze.

mhofman commented 1 year ago

I am skeptical of any parallelization scheme that involves going back to the kernel process for every vatStore operation. I would really like the interface between kernel and vat to be akin to a machine to machine channel, similar to CapTP

warner commented 3 months ago

One open question is how the kernel-side scheduler should work. There will be an in-consensus "kernel order", a sequence of events like "process kernel IO input event 4" (eg a timer or bridge update), "deliver (retire) vat 5 input-queue event 6", or "process vat 5 output-queue event 7". The vat input events have hopefully been executed earlier, but this is the point at which we demand their results, and any replica which didn't manage to execute that input by now will be forced to block until it is complete.

@erights and I talked today about whether we'd benefit from sampled non-determinism instead of an algorithmically in-consensus scheme. The latter would mean all replicas would share some policy, by which they would decide (independently) which input/output events must be processed/retired when.

The former would mean that the block proposer would decide, perhaps by a special kind of transaction. There would be exactly one txn of this type in each block, and it would dictate which events must be processed, and in which order. One possible benefit is that the order might be more realistic: the block proposer could sample their own workers, find out which deliveries had been processed already, and then declare that set to be the official one. If other replicas had similar hardware, and their local (non-deterministic) worker schedulers had made similar progress, then they would also have the same deliveries done by "background" execution by the time the block needed to be executed, and they could vote immediately, instead of stalling for "foreground" execution to catch up.

Of course, this means the block proposer is in a favored position. Also, we must guard against the proposer asking for the impossible. Replicas should not vote for a block that demands they retire invalid deliveries. But I think this introduces some lock-step-ness into the schedule: if block-A executes delivery-1, and delivery-1 creates delivery-2, then the proposer of block-A might demand the retirement of both delivery-1 and delivery-2. But a replica only knows about delivery-1 so far (it won't discover -2 until it executes -1, and it hasn't pre-executed that far yet). So the replica cannot vote for the proposal until it has executed delivery-1 and become convinced that delivery-2 could be executed, otherwise it might be approving a chain-halting action. And it cannot vote against the proposal until it knows for sure that delivery-2 can never exist (perhaps because some alternative delivery got enqueued with the same event number). So I'm not sure how to safely take advantage of this.