manager-subprocess-xsnap should distinguish metering faults from random process death

warner commented 3 years ago

If a vat worker subprocess exits unexpectedly, our kernel does not know the state of the vat: the worker might have died because of something the vat did, or because of something outside swingset (maybe the host computer is being rebooted and all processes are being killed, in some random order). #2958 is about having a policy to react to an unexpected worker termination.

If we're in "consensus mode", we must crash the kernel: we do not know why the worker terminated, so we don't know that it's also being terminated on all other validators. In particular, metering faults are a distinct "known" form of worker termination. We need to be able to distinguish between a metering fault and some other random error.

manager-subprocess-xsnap.js has a catch inside deliverToWorker that conflates these cases:

    async function deliverToWorker(delivery) {
      parentLog(vatID, `sending delivery`, delivery);
      let result;
      try {
        result = await issueTagged(['deliver', delivery]);
      } catch (err) {
        parentLog('issueTagged error:', err.code, err.message);
        let message;
        switch (err.code) {
          case ExitCode.E_TOO_MUCH_COMPUTATION:
            message = 'Compute meter exceeded';
            break;
          case ExitCode.E_STACK_OVERFLOW:
            message = 'Stack meter exceeded';
            break;
          case ExitCode.E_NOT_ENOUGH_MEMORY:
            message = 'Allocate meter exceeded';
            break;
          default:
            message = err.message;
        }
        return harden(['error', message, null]);
      }

I'm thinking that the non-meter-fault errors should propogate an Error upwards (i.e. don't catch the default case). We can use a non-rejecting deliverToWorker return promise with a value of ['error'..] to mean "consensus metering fault", and a yes-rejecting return promise (which should then carry all the way back to controller.run()) to mean "unknown worker error, halting the kernel".

I think we can get away without fixing this for the stress-test phase this week, but the danger is that something which happens to kill a worker process might trick that one validator into thinking the vat has had a consensus metering fault, and once that validator commits the results, it won't be able to rejoin consensus. (I think. @michaelfig has investigated this more than me).

michaelfig commented 3 years ago

Well-stated!

might trick that one validator into thinking the vat has had a consensus metering fault, and once that validator commits the results, it won't be able to rejoin consensus.

That's correct. With the current problem, the kernel appears to have already committed the state that said the vat was terminated before we actually have a noticeable behaviour divergence, rather than just crashing the kernel immediately and not committing that state.

warner commented 3 years ago

This fix also needs to change deliver() in manager-helper.js, which catches all errors in deliverToWorker and replaces them with a ['error', err.message, null] VatDeliveryResult. We need a plan.

One option is to define deliver() to do one of three things:

return ['ok', null, meterUsage]: happy
return ['error', problem, null]: consensus unhappy (metering fault): terminate vat, continue with kernel
reject: non-consunsus unhappy: halt kernel

Another option is to only use return, but examine problem to distinguish between the last two cases. I don't like the idea of parsing a string to make that distinction.

I'm in favor of the first option.. any other opinions?

Agoric / agoric-sdk

manager-subprocess-xsnap should distinguish metering faults from random process death #3394