Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
326 stars 206 forks source link

"external" vat upgrade ideas #10028

Open warner opened 2 weeks ago

warner commented 2 weeks ago

What is the Problem Being Solved?

When we designed the vat upgrade mechanism (E(adminNode).upgrade(newBundlecap)), we kinda assumed that the first version of every vat would be prepared for upgrade. Back then, we figured it was just a question of the vat keeping all its state in durable baggage. But, time ran out, and we were unable to implement (and/or test) upgradability for all vats. And since then, we've discovered other impediments to upgrade, such as downstream vats not reacting well to the disconnected promises decided by the upgraded vats, where it is Vat2 that is preventing us from upgrading Vat1.

As a result, while we've successfully upgraded several vats, we have at least a few which cannot be upgraded, or for which our full upgrade process is way more complicated than we want.

One workaround is to launch a replacement vat, then convince all the original vat's clients to talk to the replacement instead. We're in the process of doing that with the price feed vats, however we had to start by upgrading the client vat to be able to accept this "please talk to a replacement" request.

@erights and @dtribble were brainstorming about ways to accomplish our goals more easily, so we started talking about kernel support for this "please talk to a replacement" feature.

Description of the Design

The starting point is a client VatA, which has an object Alice, who is talking to a service object Bob in VatB:

external vat upgrade - Frame 1

Both VatA and VatB are "upgrade naïve": they're difficult (or impossible) to upgrade. But, we need to upgrade the service provided by VatB anyways. Our plan is to introduce a new VatC, with an object "Carol" which is prepared to take over the duties of Bob. When we're done, we want Alice to be transparently reconnected to Carol instead of Bob:

external vat upgrade - Frame 2

Preliminaries

We start with our standard ground rule. We must maintain ocap security: no ambient authority, and connectivity begets connectivity.

We leverage the sizable authority of the adminNode. This is the object returned by the kernel's vatAdminService when a vat is created. If Vat1 asks the kernel to create Vat2, the caller will receive an object we'll call adminNodeVat1. With this AdminNode, the caller (something in Vat1) can direct the kernel to upgrade Vat2 to new code, retaining access to durable state, including all imports, and retaining the right to define the behavior of all exports. AdminNodes also provide .terminate().

We establish a rule: you can't "fight the future". The first version of Vat2 is entirely vulnerable to subsequent versions: whatever code it gets upgraded to will have full access to all its durable state. We declare that we won't support attempts by earlier versions to limit the power of later versions by e.g. deliberately keeping authorities in non-durable storage. Further, we assume that upgrade is intended to provide as much access as possible, even if the earlier version forgot to retain something that was important for the later version (our #1691 scheme would effectively reacquire such dropped authorities).

That means the acceptable authority of an AdminNode extends to manipulating things inside the corresponding vat. If you can upgrade a vat, you can grab any object the vat has access to (because you could upgrade the vat to code that gives you that object), or you can forward one of its exports to some other object (modulo questions about object identity) (because you could upgrade the vat to code that forwards each message).

Ideally, object identity is monotonic. Some of the approach we discuss would "merge" two objects into a single one, or manipulate c-lists to change one vat's notion of what their Presence points to. This could lead to presence1 !== presence2 at one point in time, but === at a later point in time, or to situations where a Presence sent out of the vat might round-trip back as a different Presence. That could get messy. It might be unavoidable, but if at all possible, we want all object identity comparisons to remain stable across the upgrade/replacement.

Approach 1: Remap Imports

Given the AdminNode for VatA (the client), we could build a mechanism to remap its import of Bob to point at Carol instead.

The API would be something like E(adminNodeVatA).remapImports([ [bob, carol] ]). Upon receipt, vat-vat-admin would invoke device-vat-admin, which would use its vat-admin-hooks.js API to tell the kernel to edit VatA's c-list entry, replacing the kref side, inserting Carol's kref where Bob's once was.

external vat upgrade - Frame 3

This only affects VatA: if there are multiple clients, they must all be remapped separately.

VatB can continue to run, and VatC can be fully established before the remapping, neither need to be in any special state.

VatA retains the same Presence object across the remapping, with its pre-existing vref (o-2). If VatA somehow had previous access to Carol, it might have a separate Presence (with e.g. o-6), already mapped to that same kref. This would cause c-list translation problems (breaking the "one to one" rule of c-lists). So the remapImports API should be defined to throw an error if any of the replacement objects are already present in the vat's c-list. This is easier to avoid if the replacement VatC is not yet fully operational when the remapping occurs, and it hasn't spread its replacement objects widely enough to risk them appearing at the client yet. (Note that it much launch at least enough to deliver the replacement objects to the parent vat, who can send them into the remapImports API).

Approach 2: Commandeer Exports

TBD

Promises

TBD

Security Considerations

I believe these APIs are not more powerful than the existing upgradeVat() authority, modulo the possibility of creating identity discontinuities.

Scaling Considerations

Remapping a moderate number of objects should not be a scaling problem, although there might be cases where we need to scan all vat c-lists for a given kref, which would then scale with the number of vats (moderate now, but maybe larger in the future).

Test Plan

Upgrade Considerations

To add new functionality to the AdminNodes, we must:

The kernel upgrade is straightforward. However, we don't currently have any mechanism to upgrade devices, so we'd need to invent one and implement it in the kernel. For vat-vat-admin, we have controller.upgradeStaticVat(), but we don't currently have a great place to call that from the cosmic-swingset chain code. The final issue is that upgrading vat-vat-admin will cause the existing E(adminNode).done() promises to disconnect, and e.g. Zoe might mistake that for a contract vat dying, and might react by exiting all seats and returning all escrowed asserts.

So there are many steps we must figure out before we could deploy these new APIs.

warner commented 2 weeks ago

Approach 2: Remap Exports

We can also approach this from the exporting side. The goal here would be to use adminNodeVatB to cause the Bob export to be forwarded, at the kernel level, to the replacement Carol object.

external vat upgrade - Frame 4

This would update all clients, even ones the upgrading party doesn't know about.

The simplest (but problematic) API would be E(adminNodeVatB).remapExports([ [bob, carol] ]). The kernel would use Bob to look up the kref, and then change the kernel object table to point the .owner for that kref at VatC (as the owner of Carol). It would then modify the VatC c-list for Carol's vref to map it to Bob's old kref.

If VatB is still operating, it would need to remove the VatB c-list entry, to maintain the invariant that objects are never simultaneously exported by multiple vats. It would be easier if VatB were disabled somehow (not terminated, which would trigger deletion, but it should certainly not be receiving deliveries).

The trouble is that Carol's original kref is already floating around: at the very least, both the upgrading (parent) vat and vat-vat-admin have seen it, and have Presences for both it and the original Bob. When we change VatC's "Carol" c-list entry to point at the old Bob kref, we should probably orphan the old Carol kref, and hope that the involved vats will drop it shortly.

As before, this would be easier if VatC is not really up and running yet, like if it's in some state where the replacement objects are ready to go, but it's holding off on talking to any other vat until its parent gives the all-clear.

Approach 3: Commandeer At First Export

To avoid exposing the initial Carol kref, we could use a scheme that involves a special "Claim" object. The adminNodeVatB is asked for ClaimBob Claims for Bob (and other exports), which are passed into VatC. Then, before Carol is exported for the first time, the code in VatC does something with liveslots to exercise the claim:

const startup = (claimBob, stuff) => {
  const carol = ...;
  vatPowers.claim(claimBob, carol);
};

The Claim might be represented as a special kernel object, where instead of an .owner we have a .isClaim entry, maybe pointing at the kref being claimed. When liveslots sees the call to claim(), it does a special syscall, maybe:

syscall.claim(claimVref, replacementVref);

This would be treated a bit like exporting replacementVref, as if we'd done e.g. syscall.send() with a methargs.slots=[replacementVref], except it wouldn't create a delivery. It would still create a new export c-list entry, but instead of the kernel allocating the next available kref, the kernel would:

This would be easiest if VatB were no longer running, but we could also have the kernel allocate a new kref for Bob's old c-list entry (one which nobody is currently referencing). And/or somebody (maybe the claim() call?) could be given a reference to the old object, which it would be free to use or to drop as it sees fit. (it feels like there's a swap() pattern lurking in here somewhere, that might make things better, but I haven't nailed it down yet).

warner commented 2 weeks ago

Promises

We might stop there, and say that this upgrade/replacement process only works for objects, but not promises. This would certainly be easier. Doing that would make little attempt to hide the upgrade trauma: operations in-process during the replacement would be visible to clients (their outstanding promises would be disconnected, and given what "upgrade-naïve" has meant so far, they would treat the disconnection as a failure).

But we could also find a way to remap any Promises that VatB is currently a decider on, and let VatC take them over.

From an authority point of view, we can pretend that VatB has carefully retained fulfill/reject controls over every Promise it has ever exported, and has kept them in a table indexed by the Promise, so that if we were to do a real upgrade of VatB, the new version would have full access to those resolvers. (We also must pretend that we'd implemented virtual/durable promises such that the resolvers could be held in durable storage). That gives adminNodeVatB the right to give away resolution authority over its Promises, assuming the vat hadn't done so already (by including it as a message result or something).

We we can imagine a claimPromiseB = E(adminNodeVatB).makeClaim(promiseB), which we send into VatC somehow. It would probably be easiest to use this if VatC receives a full PromiseKit (with { promise, resolve, reject }), so it can stash or wire up the resolvers as it sees fit.

I'm toying with the idea of a new vref category, strawman is p++NN, which means "the kernel is giving you a promise and its resolution authority at the same time". Liveslots would deserialize that into a PromiseKit. When delivered, the kernel would add a c-list entry like a normal exported vat-decided promise, except it would re-use the claimed kpid (promise kref) from VatB, and change the .decider to point at VatC, and do something to the VatB c-list to guard against it issuing a syscall.resolve() at some point (assuming VatB is still running, probably better if it isn't).

warner commented 2 weeks ago

Upgrade Vat Must Know More

All of this points at a general principle: if VatC wants to take over for VatB, it needs to know more, and be prepared to do more, than VatB did. The #1691 approach has the new vat being so clever that it can precisely emulate the old vat up until the big reveal. The normal upgradeVat() approach requires a cooperating old vat, which leaves state in the right places, and can allow mostly-the-same source code to be used in both versions as long as you plan ahead well enough.

These "remapping" approaches don't require the old vat to have planned ahead, but do require the new vat to know enough about the old vat's operations that it knows what to do with each Claim, or knows which replacement objects to give to the parent (so it can pass them through to the AdminNode). In that sense it requires more work of, and coordination between, both the parent vat (requesting the upgrade) and the replacement vat.

It also requires that all the relevant pieces (objects to be replaced, promises to be taken over) are available to the parent vat. Somehow it must be involved enough in the interactions between old service vat and client vat to have grabbed a copy of Bob and other things-to-be-replaced. If the only such things are public facets and ZCF facets (low cardinality, generally created at startup time, rather than per-invocation), this may be easy. But if we want to reduce the upgrade trauma and allow in-progress operations to be unaffected, the parent vat will need access to the objects used by those in-process operations too, and that might be too invasive.

warner commented 1 week ago

@dtribble brought up a more evil level of hack: something like controller.remapImport(vatID, oldKref, newKref). To make this slightly more principled / easier-to-audit, it could be remapImport(vatID, oldName, newName), coupled with a new syscall.registerName(name, vref) which stores the matching krefs in some table, and some cosmic-swingset code that can make specific controller.method() calls in response to governance proposals (without a chain-halting upgrade). Then we do one upgrade/core-eval where we create a new vat, send it both Bob and Carol, and have it register those names in the table, then do a second proposal which uses those names on the controller API to effect the handoff.

That would trade the cost of adding vat-admin APIs for the cost of adding kernel APIs, a new liveslots thing, and writing the short-lived vat whose only job is to record the right objects. Depending upon how we exposed the new syscall to the new vat (vatPowers.registerObject(name, presence)?), it may or may not be safe to keep it around into the MN-3 world, because we'd be introducing a race: a malicious vat could register the same name, in an attempt to commandeer the old kref for itself, before our own proposal could register the right one.

We could also imagine a special logging feature, to which vats could send Presences, and the kernel would emit their krefs. Then we could have one upgrade/whatever which logged both Bob and Carol, and then we tell all the validator to look at their logs and write down the krefs, and then compare them against the body of the second proposal (which, when executed, calls controller.remapImport(vatID, oldKref, newKref).

warner commented 1 week ago

Use cases:

Just today, on mainnet, we launched a new auctioneer vat, to fix some problems, leaving the original auction vat running. The next scheduled update will launch new price-feed vats, and will also launch a new (third) auctioneer vat, not because it contains significantly different code, but because each auctioneer samples the price-feed registry just once, at startup, and we want that last auctioneer to use the new price-feed vats. So the relationship between price-feed vats and the auctioneers who use them will be:

auctioneer price-feed
1st 1st
2nd 1st
3rd 2nd

If we could remap imports, then instead of this upcoming proposal starting a third auctioneer vat, it could instruct the kernel to remap the imports of the second auctioneer vat to point at the new (2nd) price-feed vats. We wouldn't know those krefs until the price-feed vats had launched, which is why we'd either need to log them (and only then build the remapping proposal), or somehow register them by name.

(In general, I think our existing plan to launch a new auctioneer is the simplest and most robust, as it requires no new code, and we've now done it once already)