Closed warner closed 3 years ago
Well-stated!
might trick that one validator into thinking the vat has had a consensus metering fault, and once that validator commits the results, it won't be able to rejoin consensus.
That's correct. With the current problem, the kernel appears to have already committed the state that said the vat was terminated before we actually have a noticeable behaviour divergence, rather than just crashing the kernel immediately and not committing that state.
This fix also needs to change deliver()
in manager-helper.js
, which catches all errors in deliverToWorker
and replaces them with a ['error', err.message, null]
VatDeliveryResult. We need a plan.
One option is to define deliver()
to do one of three things:
['ok', null, meterUsage]
: happy['error', problem, null]
: consensus unhappy (metering fault): terminate vat, continue with kernelAnother option is to only use return
, but examine problem
to distinguish between the last two cases. I don't like the idea of parsing a string to make that distinction.
I'm in favor of the first option.. any other opinions?
If a vat worker subprocess exits unexpectedly, our kernel does not know the state of the vat: the worker might have died because of something the vat did, or because of something outside swingset (maybe the host computer is being rebooted and all processes are being killed, in some random order). #2958 is about having a policy to react to an unexpected worker termination.
If we're in "consensus mode", we must crash the kernel: we do not know why the worker terminated, so we don't know that it's also being terminated on all other validators. In particular, metering faults are a distinct "known" form of worker termination. We need to be able to distinguish between a metering fault and some other random error.
manager-subprocess-xsnap.js
has acatch
insidedeliverToWorker
that conflates these cases:I'm thinking that the non-meter-fault errors should propogate an Error upwards (i.e. don't
catch
the default case). We can use a non-rejectingdeliverToWorker
return promise with a value of['error'..]
to mean "consensus metering fault", and a yes-rejecting return promise (which should then carry all the way back tocontroller.run()
) to mean "unknown worker error, halting the kernel".I think we can get away without fixing this for the stress-test phase this week, but the danger is that something which happens to kill a worker process might trick that one validator into thinking the vat has had a consensus metering fault, and once that validator commits the results, it won't be able to rejoin consensus. (I think. @michaelfig has investigated this more than me).