new device model: read/writeLater

We (me, dean, mark, chris) had a long discussion today about device drivers in SwingSet. Not sure we came to any clear conclusions, but here's what I remember:

We need three properties:

safely manage cross-Realm calls: protect against leaking primal-realm objects into the kernel-realm, by coercing all arguments to primitive types, and catching/rewriting exceptions
maintain transcript-based persistence of vats
provide synchronous (at least) read access to endowments

We have three devices in mind: Mailbox (exists), Timer/Clock (#148), and something to help integrate the cosmos-sdk Bank module's account entries with an Issuer Vat's purses.

Dean's experience suggests that the React model is the best one to follow: handler functions get a snapshot of the current state, plus a function that lets them queue changes to be applied at the end of the operation (iff i doesn't get cancelled or aborted for some reason).

We defined three levels of turns:

a "Turn" takes us from one empty Javascript stack to the next empty stack
at a "Crank" boundary, both the javascript stack and the Vat's runnable-promise queue are empty. There may still be unresolved promises with .then callbacks queued, but there will not be any resolved promises with .then callbacks. This corresponds to a kernel.step(), and is achieved by using setImmediate() to wait on the IO queue, which is strictly lower-priority than the promise queue. Swingset Vats cannot share JS Promises or resolver functions (they can only share data and reference slots, through the kernel), therefore nothing in a Vat can get control until the kernel next invokes the dispatch() function, even if other Vats resolve their own promises (i.e. the object graphs of separate Vats are disjoint except for their link through the kernel syscall/dispatch, and those calls don't accept JS promises)
the "Batch" contains as many Cranks as we can do before saving the state to durable storage. In a blockchain environment, each Batch goes into a separate block. We end the Batch when we finish the entire run-queue, or when we've done too much work (e.g. we've run up against the blocksize limit, or gas limit for a single block), or when we want to checkpoint our state so we don't lose too much progress if the host crashes.

We were converging on a device design that exposes two functions to the calling vat: syscall.read(devnode, args) -> results, and syscall.writeLater(devnode, args) -> error. The first causes the device to be invoked with dispatch.read(devnode, args) and should return synchronous data, but not modify any state. The second would cause two device invocations. The first is a checkWrite that should compare the args against a shadow state object that it builds up, to see if the writes would succeed later (e.g. does a balance transfer underflow, or does an argument refer to a missing timer object). checkWrite is allowed to return an error, but it is not allowed to commit any state changes to its endowments. If it returns success, the kernel pushes a call to dispatch.write to a special queue that does not run until the end of the current Crank (so it will be abandoned if any other Turn causes a vat-killing error). At the end of the Crank, all device writes are delivered, giving the device an opportunity to commit the changes to the endowments.

Rather than dispatch-style calls, another possible device API could be:

read(devnode, args) -> data
write(devnode, args, S) -> error or function commit(S2) -> newS2

Where both close over the device's endowments. write() gets an S object that is an empty Map for the first call during the Crank, and a copy of the previous one for subsequent calls. Each syscall.writeLater causes an immediate device write() call, and S lets it accumulate both changes that need to be committed at the end of the crank, and a "shadow table" of data that helps figure out if the arguments are legal.

At the end of the Crank, the kernel invokes the commit function with a separate S2 state object. commit should pull the queued writes from S and apply them to the endowment. It can also return a modified S2 object to retain non-endowment state from one Crank to the next. E.g. the Timer device would use S2 to record the set of callback handler objects for each timer, and the Mailbox would use it to record the inbound handler object.

One other note: the motivation for synchronous-read on devices is to find a way to connect an ERTP Issuer (in a swingset vat, within a cosmos-sdk/tendermint chain) with a cosmos-sdk Bank module (which has a table of balances, indexed by a public key), probably for a "native" token like the one used for staking/delegation, or for paying gas fees on txns before swingset even has a chance to see the message. We expect to want to manipulate the account balance table with ERTP Purses.

The Issuer's balance table normally maps from a Purse object to a balance (an integer). We might change that to map to both a balance and a public key. From the ocap side, if you have a purse, you can ask for its public key. If you create a new purse (issuer.makeEmptyPurse()), you can supply a public key. And with enough handwaving, if you have a private signing key that matches the purse, you can sign a message that authorizes a transfer to some other purse, somehow (this part will probably resemble whatever caps-as-secret-data scheme we use to bootstrap new vats.. which was straightforward in the "swissnum" days, but is less clear now that we're all about clists and handoff tables. but we know we need something for bootstrap, and whatever secrets we use there, could conceivably be used to wrangle the layering violations to let a signed message access ocap references).

So the thought is that ocap/swingset messages can pretend that the Issuer vat is the sole source of truth, at least for the duration of a single message processing turn. But the "native" token balances must be correct in the cosmos-sdk Bank table, so "native" messages (Bank.transfer, or gas fees) can deduct them. So the Issuer's .getBalance() or .deposit() needs to be able to ask for the Bank table's balance as if it owned the table, which I think means synchronous reads from the device (if it were limited to async reads, another Vat could tell that the Issuer didn't really own the table, with TOCTTOU issues appearing during .deposit).

But the Bank-balance writes can be deferred until the end of the Crank, which is not observably different than being synchronous, without losing the transactionality of the updates (squelching the writes if the turn aborts).

The Issuer's balance table normally maps from a Purse object to a balance (an integer).

Right now, it's a purse or payment object to an amount. It would be interesting to see if we could only store the quantity, which would be a Nat in the default configuration, but could be other things, including data including other amounts.

There were a couple of other issues we considered, and I'm going to record some of the outcomes to save us some of the work of thinking them through again.

Devices are a lot like other vats, and (under the rubric that similar things should either be made the same or be clearly distinct) we talked about whether they should continue to be a distinct thing, with a similar interface, or if we would be better off providing access to non-vat functionality by allowing particular vats to receive endowments. Devices get access to endowments, which give them private access to functionality that can't be implemented in a vat. They implement objects which are accessible to vat code as clist entries supported by the kernel, and can similarly access other objects known to the kernel.

Objects provided by devices are accessed using D(), which is like E(), except that its calls can be synchronous. So far, we've made all those calls send-only, so the device code can't tell that the calls are synchronous. We mostly agreed that devices should have exclusive access to some endowment, but they should be closely held by a "wrapper" vat. The wrapper vat handles asynchronous requests from other vats, and implements them in terms of calls on the device objects. The mailbox device follows this model, and the timer device probably will, but we haven't concluded that it's a pattern we should enforce.

If we were to remove the distinction between devices and vats, this would have implications for how endowments are provided, and how we achieve orthogonal persistence. We might support persistence by having devices record incremental changes to their state, which they could then query during checkWrite, or we could have the device always be reading from the frozen state as of the beginning of the turn, while writes are queued up and sent at turn end.

Dean and I talked more today about the synchronous endowments question. I think we settled on not needing synchronous access, which is great because for Agoric/agoric-sdk#54 I want to make all syscalls async: basically the userspace Vat code will issue syscalls as it runs, but the kernel merely queues them up, then after control returns to kernelspace, the kernel can make async DB requests out to the host to collect all the state it needs to execute those syscalls, as lazily as it wants. If any syscall has to return a synchronous value, I can't defer their execution.

We focused on Meters, and how to manage them with normal ERTP primitives. Specifically we thought about the execution-fee layer, which exists to protect nodes/validators against spam and DoS attacks. The idea is that every message must include a deposit, and if we can bound the amount of time/CPU/etc we spend on the message before having enough information to claim the deposit, then at least there's an economic argument against unbounded junk messages. To keep this early-verification cost low, we can't be doing a lot of work (no kernel invocations, not Vat messages yet): just a simple signature check and deduction of a ledger entry indexed by the public key. If that ledger entry goes negative, the message is rejected. If the message is accepted, and passes subsequent (more expensive) checks, then maybe the deposit is refunded, or maybe it's just transferred to the validator as payment, or something.

This pay-something-for-execution model looks sufficiently like the Meters and Keepers that we plan to use in the Agoric/agoric-sdk#23 escalator scheduler that we're just calling them (gas) Meters. So each Meter is associated with a specific public key and a matching entry in the ledger (which lives outside the SwingSet kernel). Now the trick is that we want to be able to refresh these Meters using ERTP Payments.

We figure that we'll have an Assay (the new name for Issuer) that manages these gas tokens. This Assay can issue Purses. People who are trading these tokens will have regular Purses with some balance, but there's a special extra Purse that represents all the tokens that are somewhere in the ledger (owned by Meters). There's a Meter Manager (meter maid?) which owns this purse. The MeterManager has an API that lets you create a new Meter by giving it a public key, which it uses to create a new ledger entry. The Meter you get back can accept a Payment (which wraps a Purse) to "feed the meter". The manager deposits those tokens into the special extra Purse, waits for that to finish, then issues an (async) device message to increment the given ledger entry by the given number of tokens. This message goes into a queue (just like Mailbox messages) that is processed by the "host loop", after the kernel has finished the Crank. The tokens will be available in the ledger some time after the Meter was fed.

The Meter also supports a withdrawAll message. This sends another async message to the ledger device that says "reduce the balance to zero, and tell me what the previous balance was". When the ledger device reacts to this (again, outside the Crank), it appends a message to the run queue that says "the old balance was XYZ". When the Meter receives this message (in some future crank), it withdraws XYZ tokens from the shared Purse, and emits a Payment for that balance. We can also provide a withdraw message that takes a desired amount and then just might fail in some way if that amount wasn't really in the ledger (maybe all-or-nothing, maybe withdraw-up-to). But the invariant is that the Meter never gets to directly set the ledger balance to any absolute value (except for zero): only deltas.

Finally the Meter supports a getBalance message, which does the same device interaction except without the reduce-to-zero. Like Heisenberg's uncertainty principle, you can never learn the current balance: you only get to learn a balance some time after you asked, and some time before you got the answer. Inbound messages might deduct tokens from your Meter before or after getBalance samples it (or, if you're a validator, maybe it added tokens).

Likewise, withdrawAll might race against external changes to the ledger balance. This might result in a non-zero final balance, if the deposit effectively arrived after the withdrawal.

We think this is sufficient to do what we need w.r.t. gas balances, and that probably means it's sufficient for other cosmos-sdk Bank -module balances. I think it means you can't use your Meter as an exclusive Purse (because there's always some outside-the-SwingSet means for its balance to be changed, so vat code can never really have exclusive access to it). But you can get Payments from it, and deposit Payments into it, just like Purses.

We thought a lot about gas and Meters/Keepers too. We can afford an async lookup of a Meter state before calling a Vat's deliver.dispatch, as long as we follow a rule that once we give control to the Vat code, we don't need to fetch any further state until it returns control to the kernel.

So I'm planning to go ahead with the async-ification of the SwingSet syscall API.

I added some notes to a new ticket before remembering this one, here are those notes:

The current swingset device model defines "devices" as containers very much like vats: the kernel dispatches some messages into them, they can make some syscalls back out through the kernel, there is a c-list between the two. Where vats export "objects", devices export "device nodes". But unlike vats, which are completely isolated from the outside world and can only communicate through the kernel, devices are given some collection of endowments: arbitrary javascript objects that can do whatever the host (who created them) likes.

Vats interact with each other by calling syscall.send(). The kernel translates the arguments through the vat's c-list, and puts the resulting message on the back of the kernel's run-queue. When the message gets to the front, it looks up the owner of the target object, translates the arguments through the target vat's c-list, and calls that vat's dispatch.deliver(). A similar pathway works between vats for Promises: one vat does syscall.fulfillToData (or one of its siblings), and some other vat(s) eventually get a dispatch.notifyFulfillToData().

Vats interact with device nodes through a special syscall named syscall.callNow(). The kernel translates the arguments through the vat's c-list, figures out which device owns the device node, translates the arguments again through the device's c-list, then dispatches into the device's dispatch.invoke(). The other special thing about devices is that vats can interact with them synchronously. Unlike syscall.send(), which returns (nothing) as soon as the message is queued, syscall.callNow() waits for the kernel to invoke the target device, and waits for that device to return a value. Whatever dispatch.invoke() returns is translated just like arguments would be, and is returned to the calling vat as the return value of syscall.callNow(). This is exposed to the upper-level code in a vat as the "D" invocation: retval = D(devnode).methodname(args).

The idea was to provide a general-purpose (but capability-friendly) way for vats to interact with the outside world, mediated by devices. This makes syscall.callNow() roughly similar to a unix ioctl() call on a character device: arbitrary arguments, arbitrary return value, with synchronous/blocking semantics.

read-old/write-later API

Dean has recommended a more structured API. Vats would be limited to invoking devices with either an explicit read() or an explicit write(). Both would accept arguments. The API would incorporate some kind of transaction boundary, perhaps a crank or a block. read() would be defined to return data about the state of the world at the beginning of the current temporal region: two read()s in the same block would always return the same data, no matter what write() calls appeared in between. And write() would be defined to queue up state changes that will be applied at the end of this temporal region. A read/write/read sequence would pretend that the write didn't happen. A write/write sequence, with arguments that cause the writes to overlap, would let the last one win.

devices

Devices aren't just passive: the device itself gets to push messages onto the run-queue. We need this for any sort of inbound events:

when the mailbox device receives new inbound messages, it does a syscall.sendOnly to the vat-tp/comms vat with the subset that were not already acked
when the host informs the timer device about time passing, the device might trigger vat messages if an alarm/timeout occurs or a repeater has reached its next firing time
inbound IBC messages get routed to a handler vat, and/or the vat-tp/comms vat

However these events only happen in the spaces between cranks/blocks: they do not (should not / must not) occur while a crank is running. So in a sense the inbound events are updating the "state of the world" before the block, and any device reads happening during the block should sample that updated state.

Agoric / agoric-sdk

new device model: read/writeLater #55

read-old/write-later API

devices