how to parallelize new vat deliveries to improve throughput performance

warner commented 2 years ago

What is the Problem Being Solved?

We could speed up chain throughput by a factor of maybe 4 or 8 by parallelizing multiple new vat deliveries, ideally making one delivery per CPU core.

(note: this is different than parallelizing replay: that's much easier, because all syscalls are simulated)

Our chain-side execution loop currently looks like:

start loop
participate in consensus, vote, wait for finalized block (with a bunch of transactions)
receive finalized block
begin executing block (BEGIN_BLOCK)
- cosmos-sdk sends swingset-related messages into kernel devices, mailbox/bridge/timer events queued
all cosmos transactions processed, initialize end-of-block work (END_BLOCK)
- swingset sees controller.run()
- swingset pulls next message off kernel run-queue, delivers to a vat
- delivery is sent over netstring pipe to worker xsnap process
- worker does work, perhaps some syscalls
- worker sends delivery results over netstring pipe back to kernel process
- kernel reports computron usage to runPolicy
- eventually runPolicy decides enough work has been done and stops serving the run-queue
END_BLOCK finishes, new AppHash computed
validator jumps to start of loop, waiting for next block proposal

With some extra work, we should be able to perform multiple deliveries in parallel. It's made a lot easier by the fact that each vat runs in its own worker process, so we don't even need threads: the host operating system already knows how to allow multiple processes get scheduled on multiple CPU cores at the same time.

The main tricky part is isolating any state that's referenced by multiple vats. Our vats are pretty isolated, but they can all make syscalls, and those syscalls reference (or mutate) state, and some of that state is shared between multiple vats. Our vat syscalls break down into three categories:

things that modify per-vat state: vatstoreGet/GetAfter/Set/Delete
things that affect the queues: send, resolve, the GC actions (dropImports/retireImports/retireExports, abandonExports), things involved with vat exit/upgrade (exit), and (to some lesser extent) subscribe
things that interact with devices: callNow

Our plans for backpressure, prioritization, and pausing individual vats, calls for breaking the single kernel-wide run-queue into a separate input and output queue for each vat, plus probably an extra kernel-wide queue for actions like "create new vat" and "terminate vat". Many of a vat's syscalls can thus mutate only their own output queue, giving us a lot more flexibility around parallelism.

The tricky part is access to devices, because callNow is synchronous. So two vats which each have access to a device node might both invoke callNow, and the results they return might depend upon the order in which they were invoked. In a single-threaded one-delivery-at-a-time system, that order is deterministic. But if we allow two deliveries to run in parallel, the invocation order is no longer deterministic.

Most devices don't do this, but unless we can rule out the possibility, then we must either exclude vats with device nodes in their c-lists from parallelism, or have some way to abort speculative deliveries that wind up making a device call. We might also want to mark certain device nodes as being mutating or sensitive to invocation order, so that we could continue to allow parallelism for vats which hold those nodes, and only deny it to vats which hold more sensitive device nodes.

We don't use a lot of device nodes in our system: we obviously cannot avoid them entirely (if we want vat code to influence the outside world at all), but most user-provided vats don't interact with them. Off the top of my head, the ones I can think of are:

bundlecaps: each bundlecap is a device node, to allow D(bundlecap).getBundle() -> bundle to run synchronously. Each contract vat holds at least one bundlecap (delivered in vatParameters), so the ZCF layer knows what contract code to execute. We don't really need the bundlecap past this point, but 1: vatParameters are not GCed very well right now, and 2: device nodes are not GCed very well right now. So all contract vats probably inadvertently hold on to a bundlecap device node forever, even if they never invoke it again after startup.

A number of built-in or utilities vats hold onto device nodes, so that userspace/contract vats don't have to. The contract vat references an object within the utility vat, and that object wraps the device node:

timers: when vats ask for delay(), or set up a repeater, they send a message to vat-timer, and vat-timer then needs to interact with device-timer
vat-vattp holds onto the mailbox device, plus some other device nodes for networking purposes (and I think IBC is built on top of this layer)
vat-bridge holds onto bridge device nodes. I think most of our "vats talk to cosmos-sdk" layers go through these bridge objects.

So a lot of contract operations that want to interact with timers, or send messages through the bridge device, will cause vats to be scheduled like: [contractVat, vat-timer (uses device-timer), contractVat], or [contractVat, vat-bridge (uses device-bridge), contractVat]. We can parallelize multiple contract vat deliveries together, but we'd need to serialize the resulting calls to vat-timer or vat-bridge. The utility vats are never doing very much work. That might suggest we want a scheduler that does some large batch of userspace/contract vat deliveries first (parallelizing heavily), then performs a large number of short-duration serialized deliveries to utility vats. Or, the scheduler groups the potential work to do by the shared resources it wants to access: put all vats that have timer/bridge device-node access in a single group, and serialize all deliveries within that group (while allowing parallelization between that group and non-device-using contract vats).

Description of the Design

The basic approach would be:

choose a consensus-wide parallelism factor PF, maybe 4 or so
- setting PF significantly more than the number of CPU cores in the least-powerful validator is probably a waste
- although given the overhead of messaging and syscalls, having it be a little bigger would probably improve CPU utilization.. we'd need some serious benchmarking
the "which vats should get to run next" scheduler produces a sequence of deliveries, as usual
- this sequence of deliveries is still in consensus order
- but instead of choosing only the first delivery from the list, we choose up to PF deliveries that are going to distinct vats
we send all PF deliveries to their vats at the same time, allowing them to run in parallel
- for each vat, we collect the requested state changes made by its syscalls in RAM
when the first delivery finishes, we apply its state changes to the DB (adding items to that vat's output queue, making changes to that vat's vatStore, maybe adding things to the kernel-wide extra queue for createVat/etc
- we inform the runPolicy about the computron usage of the retired delivery, and see if it's willing to allow more work
- if so, that opens up a slot for another delivery to be executed: pull the next one from the scheduler and start it
wait until the next scheduled delivery finishes, then retire its state changes in the same way
eventually, the runPolicy tells us to stop
we must still wait for the outstanding deliveries to finish, and we retire their state

The number of deliveries made in any given block will thus be dictated by both the runPolicy's computron limit, but also by the PF parallelism factor: we'll do up to PF deliveries after the limit is reached. We already act this way (the computrons consumed by a block will always be greater than the runPolicy threshold: we always learn about exceeding its limit too late), but currently we act as if PF = 1.

We'll need to modify the syscall implementations to store their pending state changes in RAM until the delivery is retired, so that we aren't making DB changes in a non-consensus order. We currently use vatTranslator.js to convert vat-format syscalls into kernel-format syscalls, and this translation can change shared kernel state (allocation of newly-exported kernel objects, refcount increments). Then a shared kernelSyscall.js executes the kernel-format syscalls, which is where e.g. syscall.send appends new items to the shared run-queue. We'd need to rewrite this to enqueue vat-format syscall objects (VatSyscallObject) until the delivery can be retired, and perform both translation and execution only at that later point in time.

The vatstore syscalls are all string-string KV store operations, so translation should not modify refcounts or introduce new objects. So our enqueue-VSO code could execute vatstoreGet/GetAfter calls immediately (reading from the vat's portion of the kernel DB). The (mutating) vatstoreSet/Delete need to have their changes enqueued.

One approach for this might be to create a separate crankBuffer for each vat. The crankBuffer only knows about kvStore writes, so a different approach would be to introduce a different kind of buffer (vatBuffer?) that knows more about syscalls than about kvStore writes.

Security Considerations

One of the biggest benefits of CSP (and the Actor model in general) is the complete elimination of shared-state concurrency hazards, so we must be careful to not re-introduce that hazard. Our scheduler needs to be careful to not allow device-node access or kernel DB changes to become dependent upon execution order.

The parallelism factor we choose will put validators with fewer cores at a disadvantage. Number of cores will become part of our minimum validator requirements.

Test Plan

Not sure, obviously some unit tests on the scheduler and the code that merges parallel state changes back into a consensus order for application to the DB, but we also need some sort of stress test to make sure we get a consistent order even though some deliveries take longer wallclock time than others. Probably a randomized test harness that is given a set of parallel deliveries, executes each to completion, then reports a randomized finishing order. This test should assert that the applied state changes (activityHash) remains consistent among multiple runs (giving the randomizer a chance to explore a significant portion of the ordering space).

mhofman commented 2 years ago

Most devices don't do this, but unless we can rule out the possibility, then we must either exclude vats with device nodes in their c-lists from parallelism, or have some way to abort speculative deliveries that wind up making a device call.

One way to make this deterministic is to build a kind of mutex lock around devices:

The kernel keeps an internal order of vats which are executing in parallel (based on a first started)
If a vat makes a device callNow, the kernel consults the c-list of the vats before (higher precedence). If that device node is in the c-list of a "higher precedence" vat, the syscall is stalled until that other vat's crank has completed. If not, the device call can proceed.

Obviously this is still non-optimal for high usage devices, but from what I can tell from the description, these device nodes are mostly wrapped by a single vat.

So, IMO the problem is not really parallelizing devices, but commiting parallel execution to the DB before proceeding to the potential next crank. I don't see a good way to do this that isn't in the same "start first" deterministic order. Which means if a lower precedence delivery completes before a higher precedence one, we are blocked and can't start a new crank until that higher precedence delivery completes.

Regarding test plan, while not entirely related, I've been contemplating if it'd be possible to build something like CHESS to exhaust the scheduling permutations once we introduce multiple queues. A bit of a hammer solution, but it'd make me sleep better. In theory this could all be driven by the run policy (if enough information is available about queues)

mhofman commented 1 year ago

@warner is this superseded by #6447 ?

Agoric / agoric-sdk