Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
326 stars 206 forks source link

add vat "consensus mode": ignore console logs, halt on worker failure #2519

Open warner opened 3 years ago

warner commented 3 years ago

What is the Problem Being Solved?

Vats running in a chain must behave the same no matter which validator they run on. Validators might be configured in slightly different ways that should not affect their vat-observable behavior. Anything they can sense to distinguish these differences represents a source of non-determinism which could lead to a chain fault (validators getting slashed through no fault of their own, or worst case a chain halt).

From an ocap perspective, debug logs are generally treated as write-only, so they can safely be made ambiently available without that causing increased authority to the confined code. The most important authority we usually pay attention to is an inadvertent communication channel between unrelated objects: if the debug log had any sort of observable state (e.g. a mutable severity threshold), two unrelated objects could use it to communicate.

But from a determinism perspective, write-only is not enough. A logging implementation that sometimes serializes the arguments (to record to disk), and sometimes does not (because a severity threshold was lowered to investigate a problem), will interact with the arguments in different ways. The confined code can log a Proxy (or any object with a getter) to sense when the argument is read, and thus distinguish between these two configurations. The debug configuration is not part of the deterministic input to the vat, therefore this sensory input qualifies as a source of non-determinism.

We (I think mostly @erights) concluded that vats running on-chain need to have their console configured to strictly discard all arguments without examining them at all: a complete black hole.

We know we'll be recording enough information to reconstruct the vat cranks offline, so debugging something strange that happens on-chain will involve replaying the delivery in a local debug environment. We can give this debug environment access to a console with normal functionality. Adversarial code can distinguish between the chain environment and the debug environment, but not between different validators in the chain environment.

Feature: Chain Debug
engine XS Node.js
console black hole fully-functional

Description of the Design

I'm thinking that the kernel's runtimeOptions should acquire a consensusMode boolean. When set, the console provided to vat workers is a black hole.

Security Considerations

This impacts the determinism of the swingset environment: we need a solution to prevent supposedly-confined adversarial user-level code from causing a chain fault.

Test Plan

We should have a test vat which does something like:

let reads = 0;
const sensor = Object.create(Object.prototype, {
  key: { get() { reads++; return 'value'; } },
});
console.log(sensor);
if (reads) {
 fail();
}

and make sure it doesn't fail() when run with runtimeOptions: { consensusMode: true }.

warner commented 3 years ago

Another behavior that should be controlled by "consensus mode": if a child vat-worker process (e.g. xsnap) dies, the kernel should halt (without committing any state changes).

In this situation, we don't know why the child died. Perhaps the host computer is being shut down, and processes are being sent SIGINT one at a time, and the worker was killed before the parent/kernel process. Perhaps the crank being executed triggered a bug (and invariant failure) in the JS engine. Or maybe it consumed resources beyond what was available, but in a way that wasn't caught by the metering code.

Since we don't know that the failure was a deterministic function of the vat inputs, we cannot afford to behave differently than any other validator. By halting without committing any state, we've avoided committing to any particular interpretation of the worker failure. The kernel process will be running under some sort of "keep it running" supervisor loop (systemd or whatever the local operator prefers). If it's not the whole computer shutting down, this supervisor will see the process exit (with an error code) and restart it. If it fails again, it will restart it again more slowly. Eventually it will probably give up. At some point the operator will be notified, who can then contact other validator operators to see if it's happening to everybody. A common-mode failure will show up at this point, which will necessitate a coordinated software upgrade of some sort.

If we aren't in consensus mode, we have more options. A solo machine could choose to re-start the worker from an earlier state and try again. It could also choose to terminate the vat and allow the vat's creator to deal with it. Or it could halt the machine and let the operator do something. I don't yet have a theory about which approach is better, but I'm inclined to have it restart the worker at least a few times. I certainly don't want a race during host reboot to cause a vat to be terminated.