Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
323 stars 204 forks source link

Large String device #46

Open dtribble opened 5 years ago

dtribble commented 5 years ago

Provide a device that supports synchronous access to large strings, bytearrays, or hierarchies of same (for module tree), where the data is stored outside the heap of the Vat. These device capabilities allow large data values to be passed between vats by reference.

Use Cases

With this mechanism, dynamic vat creation can refer to the source to be launched as a large-string reference.

Operations

For string or byte array

These are analogous attaching shared immutable pages to multiple domains in KeyKOS.

These can be supported in remoting: the Comms vat can check with the other side whether it already has a larger string with the associated crypto graphic hash. If so, it passes a reference ot that. Otherwise, it passes the data and establishes such a thing.

warner commented 5 years ago

Dean and I came up with this "blob device" last night, but now I have to think about whether it actually meets our goals. The hope here is that we can keep large strings (i.e. the source code being used to instantiate a dynamic Vat) out of the vat transcript, which is our current performance bottleneck (adding new vats isn't likely to happen fast enough to cause problems for a while, but at least in theory we'd like to handle very large chunks of code without incurring an eternal storage obligation for all of it). The idea was that the special static "vat creating vat" could use access to both the blob device and the vat-making device to do:

const sourceCode = D(blobdev).get();
const newVatRootObject = D(vatdev).create(name, sourceCode);
E(newVatRootObject).goodMorningVat('welcome to the world');

But the other (and more critical) goal is to maintain deterministic execution. I'm trying to come up with rules for our devices to maintain this property. So far, none of our devices actually return anything when invoked. The Mailbox device (which is really the only one we use outside of tests) handles outgoing messages by adding them to a non-kernel-visible "outbox" table, and handles externally-provoked incoming messages by queueing a message to a receiver object. So none of the D(mailboxdev).foo() calls return anything. All the non-deterministic behavior of a device is limited to adding things to the kernel's run-queue, which fits our model just fine ("arrival-order non-determinism"). But the "output" of the device then goes into the receiving vat's transcript, which is the performance thing we'd like to avoid.

So the rule might be "device call return values must be a pure function of the inputs", for which undefined certainly qualifies. The device would have to behave as if it were stateless (at least from the point of view of D(devnode) invocations), so we couldn't really treat it as an updatable key-value store. But it could be a hash-value store (i.e. our usual "the key is the hash of the data" self-validating identifier pattern), if some out-of-band process got the data in there in the first place, and if we can figure out the availability problem.

But maybe we should just merge the vat-making-device with this blob-store, instead trying to find the right set of restrictions on the blob-store to maintain our deterministic execution properties.

dtribble commented 5 years ago

I proposed including ByteArray of some kind. Note that the existing JS types for ByteArrays don't allow for an immutable one, so this would not appear as a full ByteArray.

warner commented 4 years ago

I was thinking about this again today. @FUDCo 's recent #1331 vat-config changes make it possible to statically provide a bundle to the kernel (e.g. the ZCF bundle), and then vat-zoe can ask the kernel to create a new dynamic vat loaded with that bundle, using only a short reference name rather than including the whole bundle in a message (thus adding it to at least one transcript, maybe two if the dynamic vat creation mechanism involves an internal vat).

However Zoe then immediately tells the new ZCF dynamic vat to load a specific contract bundle. These bundles aren't interned like the statically-defined ZCF bundle, so it doesn't help amortize costs of multiple instantiations of that particular contract. To improve that, of course, we need to be able to add new items at runtime, and reference them with short identifiers.

I think we have three or four requirements:

If the blobs show up in the transcript, we're spending secondary storage for their contents (every time they're referenced), but at least they aren't held in RAM forever. If we can keep them out of the transcript too, then we reduce the secondary storage requirements to O(blobsize) rather than O(blobsize)*O(usage).

I had been thinking of a blob device that operates with hashes of data (so successfully fetching HASH1 will always return the same blob). But a vat might try to fetch a missing hash, which must either return an error (making it dependent upon someone else having added the data already, which is both nondeterministic and an unregulated communication channel), or kill the vat (which is anthropically deterministic, in that any timeline which survives the fetch will behave the same as all others, but that's no way to run a service).

So now I'm thinking of the hierarchical identifiers (#455) and the way their data is managed, and how we might apply that here. Our plan for #455 is to restrict the mutable data store to just the one vat, so the liveslots layer can safely cache results without worrying about some other vat changing it while it isn't looking. But we could introduce a shared immutable store without that concern, as long as we ensure that vats can never reach for data that isn't there. And we can accomplish that by introducing a new kind of reference, the blobcap.

Blobcap

A "blobcap" is a c-list -managed reference to a chunk of immutable data, which the kernel references by hash, but which vats reference with a b+NN/b-NN identifier, in the same way they reference objects with o+NN/o-NN and promises with p+NN/p-NN.

The kernel tables currently map koid (Kernel Object ID) to the vat which owns that object, kdid (Kernel Device ID) to the device which owns that device node, and kpid (Kernel Promise ID) to the current promise state (resolved/unresolved, resolution, decider vat, queued messages). This would introduce a new table mapping kbid (Kernel Blob ID, probably a SHA256 hash) to a blob of data. The table would be kept in secondary storage, not RAM.

The c-lists for each vat currently map kpid to/from vpid, kdid to/from vdid, and koid to void. This would introduce a fourth, mapping kbid to vbid, where kbid is a hash but vbid is an index (b+NN/b-NN).

The vat (as a whole, i.e. liveslots) will somehow get one function that lets it submit data and allocate a new blobcap (b+NN, where the NN is chosen by the vat, and must be unique or the vat is killed). It would get a second function that accepts a blobcap (either b+NN to retrieve previously-stored data, or b-NN to retrieve data stored by someone else), and returns data. I think this return must be synchronous, but I'd like to explore the question further. Furthermore I don't think these can be syscalls, which are recorded in the transcript.

Liveslots would then translate the blobref into a Blob representative, in the same way it turns o-NN into a Presence and p-NN into a (Handled)Promise. The Blob representative would have methods on it to access the data. The root object, via vatPowers, would get a Blob constructor, which uses the vat's allocate-blob syscall.

The Blob representative could be held within the vat's heap for arbitrarily long periods of time before something decides it wants to use the data. Therefore we want the data to be retrieved as lazily as possible. This is the same question we have for #455: the "slowcap" approach would require the kernel to know, just before each delivery, what auxilliary data the vat will need, and provide it preemptively, but if the vat doesn't actually want to use the data right away, the vat would have to hold in in RAM for a long time, because the kernel would have no way of telling when the right moment would be to pass it back.

Async vs Sync

So the vat code needs a way to ask liveslots (through the Blob representative) for the data at the last moment, and liveslots (i.e. the vat as a whole) needs a way to ask the kernel or other secondary storage for the contents (syscall or vatPowers or something). If this facility allows for synchronous retrieval, the blob.read() function can be synchronous. If it only offers async access, then blob.read() returns a Promise, but we must worry about determinism. Two different replays of the vat transcript must not be able to observe any difference in ordering of blob.read() Promises resolution. If there could be a difference, we must record the particular ordering in the transcript, costing us more secondary storage space.

In #455 we decided that contract authors cannot safely work with the reentrancy and interleaving hazards presented by async access to large mutable data. But here, where the data is immutable, I think we could handle async access just fine. The real constraint will be determinism. A synchronous blob read can only resolve to one thing, at one time (immediately). An async read could resolve in the next turn, or twelve turns later.

Some vat workers can support blocking syscalls, others only support async ones. We might reveal this to vats through different methods available on the Blob object (e.g. .read() vs .readSync()). But if vats can be migrated or copied from one platform to another, they might wind up on a different kind of worker, violating the determinism of behavior when accessing a method that is not universally available, even if we solve the problem of making the actual reads deterministic.

warner commented 4 years ago

There might be a way to merge this and #455, maybe. In #455 it's important for the vat to know that nobody else can change the data, but maybe that could be accomplished by having the vat keep the blobcap closely-held, whereas here you register the blobcap and then send it to someone. One is mutable, the other really wants to be immutable. #455 is about sending a blobcap to yourself in the future, this is about sending it to someone else.

The underlying access method (syscall, vatPower.something, global endowment??) and the resulting determinism properties also depends upon whether the data being fetched is mutable or immutable.

If we only pay attention to this issue, not #455, then I can think of a few options:

FUDCo commented 4 years ago

Since the blobcap internally incorporates the hash of the data as the data key, I think deterministic replay of asynchronous reads should be reasonably easy. When the read call is issued, it returns a promise for the data. When the fetch of the data resolves the promise (delivering it at the start of whatever crank it happens to be ready for, just like any other promise resolution), the transcript's record of the promise resolution incorporates the blob hash (but not the data itself). On replay, this tells the execution engine doing the replay which (replayed) crank to deliver the data on, and it can pause the replay however long it needs to actually fetch the bits without the vat being any the wiser. Failure to fetch the data during replay is the same as any other kind of failure that happens during replay.

warner commented 4 years ago

Huh, interesting. So it goes into the transcript, but we deliver something slightly different than the transcript (data vs hash). I can see how that would help.

I was thinking about the nondeterminism of reads which resolved without an additional delivery: vat code does await blob.read(), liveslots does some vatPowers thing to get a promise for the data, promise resolves .. when? on the next turn? four turns later? does it even resolve before the crank ends?

Making the read response show up as a future delivery would address that problem, but I think we'd need to introduce a new kind of delivery. message (for message sends), notify (for promise resolutions), and a new.. data? Let's see, vats could get a syscall.read(blobcap) which queues an entry onto the run-queue, and then when that makes it to the top, the worker does arbitrarily-slow secondary storage reads to get the data, then delivers this new type of thing to the vat along with the data. The transcript records the blobcap but not the data. Vat code can fetch multiple blobs with Promise.all and drop the data when they're done.

Cool, I'll noodle on that some more. Thanks!

FUDCo commented 4 years ago

I don't think you have to do anything special at all on the read side -- it's just a message send somewhere that returns a promise. The promise resolution happens whenever the data is available. The only magic is how the data delivery is represented in the transcript.

warner commented 4 years ago

Oh one other idea before I forget: we'll want an exogenous blobstore-add operation; a way for something outside the kernel to add a bundle to the store. The initial static vat loads could be replaced by a blobstore-add followed by an externally-introduced (c.queueToExport) create-dynamic-vat message that references it. Later contract installs might consist of a negotiation between the submitting client and the blobstore (q.v. that bloom-filter+Jetpack install-only-the-new-modules idea I keep pursuing), followed by telling Zoe about the new handle, without the bundle contents ever actually passing through a message.