vat upgrade: the retroactive time-travelling manchurian candidate sleeper agent protocol

warner commented 4 years ago

What is the Problem Being Solved?

One blocker for mainnet is confidence that we can upgrade today's vats to tomorrow's code. This requirement is made more exciting by:

chains: the code must behave consistently and deterministically on all replicas
schema-less orthogonal persistence: when state is kept in JavaScript Maps/WeakMaps instead of a database, we can't just write a new program and give it the old data

If we can plan ahead far enough to keep all necessary data in a database-like table, with a well-defined schema, then we've got more options (#3062), but we might not pull that off (we might have to upgrade vats which don't do that).

So the question is: given a vat which didn't plan ahead for upgrade, which has several months of history and accumulated state on the heap (intricately entwined with the objects that provide its behavior), how do we replace the code with something that behaves better, while preserving the state?

Cutover

Since our chain's history is fixed, whatever changes we make to upgrade version 1 to version 2 cannot take visible effect until some activating event. This cutover will be triggered by some new message (probably to the vat's root object) which has never been seen before.

The history up until the cutover point is consistent with the behavior of version 1. Any vat running version 1 and receiving the recorded sequence of input messages will produce the recorded sequence of output messages.

The history up to that point is also consistent with the behavior of a version 2 vat that has not yet seen the trigger message. From an outside observer's point of view (seeing only the output messages), the version 2 code behaves just like version 1 did. It might be doing something different internally, like building up data structures that will ease the cutover, but none of the external messages will reveal that fact. Then, when the trigger is received, the vat is free to behave in some new way.

So our upgrade tool is to replace the original vat's code with version 2, which has been carefully designed to mimic version 1 until the trigger message is received. Then we replay the entire transcript to bring the vat back up to the current block. Eventually (we can wait as long as we want) we send in the trigger event, and cutover happens.

Shadow State

The version 2 vat code might construct a parallel set of objects to prepare for the cutover. These objects weren't in the version 1 code because they weren't necessary for its main functionality.

For example, a Mint uses a WeakMap to expose Purse objects to external vats (we don't have inter-vat GC yet, but when we do, we definitely want unreferenced Purses to be GCed). However, you can't iterate through a WeakMap. So our version 2 code might also construct a Set of Purses. If our cutover event requires converting each old Purse into some new object (perhaps creating corresponding Purses in some new vat), this Set would give the cutover code something to iterate through. The version 2 code would be a copy of the version 1 code, plus a few lines that add each new Purse to the shadow Set, plus the code that activates at cutover (and reads from the Set).

This leads to a colorful analogy. Suppose you want a raise, everyone knows you've been asking for years, but your boss (who is a public figure and appears in newscasts all the time) won't give you one. Also suppose that you have a time machine but you can't change recorded history. Finally, suppose you can build an undetectably human-looking cyborg. So you train your cyborg on all those newscasts, go back in time, and swap out your boss (sending them on a nice long multi-year holiday in Tahiti, because while you might be/have a time-travelling cyborg, you're no monster). The cyborg then spends the next N years behaving just like your boss did (including denying you that raise), so all the newscasts retain their accuracy. Then, back in the present, you say the secret trigger phrase (which is "Please", of course), and the cyborg boss has a change of heart, finally giving you that raise.

On this second timeline, the cyborg has been quietly running a model of the original boss the whole time, using the model's actions to drive its external behavior. Meanwhile some secondary code (the sleeper agent) follows everything it does, learning from it, preparing to take over. The trigger message activates the sleeper agent, freeing it from the obligation to mimic the original.

Nobody else can prove any inconsistency: they don't have a time machine, so they can't go back and say your secret trigger phrase earlier, which would have revealed the divergence. The past is done. As far as they can prove, the boss only denied those raises because you had never asked politely enough before.

Ephemeral Shadow Table

The extra state managed by the sleeper agent might be too large to comfortably keep in RAM. We might want to give vats the ability to use secondary storage which is not checkpointed or made durable against chain restarts, for limited periods of time, to support cutover. This might work in conjunction with something like the "hierarchical identifiers" (#455), so that liveslots can help translate the objects being managed with the object IDs (o+NN) being exported into the kernel.

In the Mint example above, the Set of purses might be large. Rather than holding them in memory during the second timeline, the vat might just save a table (mapping exported object ID to balance) into this temporary database. Then, when the cutover event happens, the transfer code iterates through the database and upgrades each Purse separately. Finally, the old vat is deleted, releasing the secondary storage.

If this secondary storage were managed through the #455 approach, the syscalls used to store the data would not be consistent with the original timeline's transcript.

This only works if we give up both durablity of this temporary table, and the ability to save/reload the vat during the second timeline. Each chain node must be able to run without interruption from the time we restart the chain with the new agent, to the time cutover happens and we stop using the table.

Vat Transfer

The cutover code might need to move ownership of objects from the old vat to some new one. See #1692 for the beginnings of a design.

dckc commented 2 years ago

In discussion of upgrading from the Compartment loader shim to a native XS Compartment loader (or even evaluator as in #2480), we discussed upgrade, and @erights pointed out that this replay protocol is promising. I asked about metering. @erights said there are plans to accept / believe the original metering observations during replay.

That seems like it should work.

p.s. @FUDCo notes that even during replay, some metering should be used to prevent runaway computation.

warner commented 2 years ago

Last week we talked about GC determinism of the sleeper agent.

In general we don't assert anything about GC syscalls during transcript replay. We tolerate the replay performing them at different times than the original, but my analysis assumed that the replay would mostly be the same: something might get dropped earlier or later than the original, but userspace still behaved the same way, so REACHABLE vs anything else was still the same.

This imposes a limitation on the sleeper agent. If the original imported some Presence, used it for a while, then dropped it, the kernel will have observed the original vat do a syscall.dropImport/syscall.retireImport, and propagate the decref to the upstream vat, which might cause the original Remotable to be deallocated. In this case, the sleeper agent can't successfully hold on to it any longer than the original (e.g. to pass it through baggage to the successor, so the successor can do more calls to it).

This could be a big deal, or a minor one, I'm not sure. It might help to have the upgrade process send a bunch of important references along with the activate() message, so the vat will have legitimate/up-to-date access to those objects that it might have kept around since the initial setup phase.

Agoric / agoric-sdk