Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
326 stars 206 forks source link

how to parallelize new vat deliveries to improve throughput performance #5747

Open warner opened 2 years ago

warner commented 2 years ago

What is the Problem Being Solved?

We could speed up chain throughput by a factor of maybe 4 or 8 by parallelizing multiple new vat deliveries, ideally making one delivery per CPU core.

(note: this is different than parallelizing replay: that's much easier, because all syscalls are simulated)

Our chain-side execution loop currently looks like:

With some extra work, we should be able to perform multiple deliveries in parallel. It's made a lot easier by the fact that each vat runs in its own worker process, so we don't even need threads: the host operating system already knows how to allow multiple processes get scheduled on multiple CPU cores at the same time.

The main tricky part is isolating any state that's referenced by multiple vats. Our vats are pretty isolated, but they can all make syscalls, and those syscalls reference (or mutate) state, and some of that state is shared between multiple vats. Our vat syscalls break down into three categories:

Our plans for backpressure, prioritization, and pausing individual vats, calls for breaking the single kernel-wide run-queue into a separate input and output queue for each vat, plus probably an extra kernel-wide queue for actions like "create new vat" and "terminate vat". Many of a vat's syscalls can thus mutate only their own output queue, giving us a lot more flexibility around parallelism.

The tricky part is access to devices, because callNow is synchronous. So two vats which each have access to a device node might both invoke callNow, and the results they return might depend upon the order in which they were invoked. In a single-threaded one-delivery-at-a-time system, that order is deterministic. But if we allow two deliveries to run in parallel, the invocation order is no longer deterministic.

Most devices don't do this, but unless we can rule out the possibility, then we must either exclude vats with device nodes in their c-lists from parallelism, or have some way to abort speculative deliveries that wind up making a device call. We might also want to mark certain device nodes as being mutating or sensitive to invocation order, so that we could continue to allow parallelism for vats which hold those nodes, and only deny it to vats which hold more sensitive device nodes.

We don't use a lot of device nodes in our system: we obviously cannot avoid them entirely (if we want vat code to influence the outside world at all), but most user-provided vats don't interact with them. Off the top of my head, the ones I can think of are:

A number of built-in or utilities vats hold onto device nodes, so that userspace/contract vats don't have to. The contract vat references an object within the utility vat, and that object wraps the device node:

So a lot of contract operations that want to interact with timers, or send messages through the bridge device, will cause vats to be scheduled like: [contractVat, vat-timer (uses device-timer), contractVat], or [contractVat, vat-bridge (uses device-bridge), contractVat]. We can parallelize multiple contract vat deliveries together, but we'd need to serialize the resulting calls to vat-timer or vat-bridge. The utility vats are never doing very much work. That might suggest we want a scheduler that does some large batch of userspace/contract vat deliveries first (parallelizing heavily), then performs a large number of short-duration serialized deliveries to utility vats. Or, the scheduler groups the potential work to do by the shared resources it wants to access: put all vats that have timer/bridge device-node access in a single group, and serialize all deliveries within that group (while allowing parallelization between that group and non-device-using contract vats).

Description of the Design

The basic approach would be:

The number of deliveries made in any given block will thus be dictated by both the runPolicy's computron limit, but also by the PF parallelism factor: we'll do up to PF deliveries after the limit is reached. We already act this way (the computrons consumed by a block will always be greater than the runPolicy threshold: we always learn about exceeding its limit too late), but currently we act as if PF = 1.

We'll need to modify the syscall implementations to store their pending state changes in RAM until the delivery is retired, so that we aren't making DB changes in a non-consensus order. We currently use vatTranslator.js to convert vat-format syscalls into kernel-format syscalls, and this translation can change shared kernel state (allocation of newly-exported kernel objects, refcount increments). Then a shared kernelSyscall.js executes the kernel-format syscalls, which is where e.g. syscall.send appends new items to the shared run-queue. We'd need to rewrite this to enqueue vat-format syscall objects (VatSyscallObject) until the delivery can be retired, and perform both translation and execution only at that later point in time.

The vatstore syscalls are all string-string KV store operations, so translation should not modify refcounts or introduce new objects. So our enqueue-VSO code could execute vatstoreGet/GetAfter calls immediately (reading from the vat's portion of the kernel DB). The (mutating) vatstoreSet/Delete need to have their changes enqueued.

One approach for this might be to create a separate crankBuffer for each vat. The crankBuffer only knows about kvStore writes, so a different approach would be to introduce a different kind of buffer (vatBuffer?) that knows more about syscalls than about kvStore writes.

Security Considerations

One of the biggest benefits of CSP (and the Actor model in general) is the complete elimination of shared-state concurrency hazards, so we must be careful to not re-introduce that hazard. Our scheduler needs to be careful to not allow device-node access or kernel DB changes to become dependent upon execution order.

The parallelism factor we choose will put validators with fewer cores at a disadvantage. Number of cores will become part of our minimum validator requirements.

Test Plan

Not sure, obviously some unit tests on the scheduler and the code that merges parallel state changes back into a consensus order for application to the DB, but we also need some sort of stress test to make sure we get a consistent order even though some deliveries take longer wallclock time than others. Probably a randomized test harness that is given a set of parallel deliveries, executes each to completion, then reports a randomized finishing order. This test should assert that the applied state changes (activityHash) remains consistent among multiple runs (giving the randomizer a chance to explore a significant portion of the ordering space).

mhofman commented 2 years ago

Most devices don't do this, but unless we can rule out the possibility, then we must either exclude vats with device nodes in their c-lists from parallelism, or have some way to abort speculative deliveries that wind up making a device call.

One way to make this deterministic is to build a kind of mutex lock around devices:

Obviously this is still non-optimal for high usage devices, but from what I can tell from the description, these device nodes are mostly wrapped by a single vat.

So, IMO the problem is not really parallelizing devices, but commiting parallel execution to the DB before proceeding to the potential next crank. I don't see a good way to do this that isn't in the same "start first" deterministic order. Which means if a lower precedence delivery completes before a higher precedence one, we are blocked and can't start a new crank until that higher precedence delivery completes.

Regarding test plan, while not entirely related, I've been contemplating if it'd be possible to build something like CHESS to exhaust the scheduling permutations once we introduce multiple queues. A bit of a hammer solution, but it'd make me sleep better. In theory this could all be driven by the run policy (if enough information is available about queues)

mhofman commented 1 year ago

@warner is this superseded by #6447 ?