Open warner opened 4 years ago
assigning to Dean to let him explain this better
@dtribble hasn't explained this better, and #3528 explores the notion of pausing vats (for metering faults, rather than kernel bugs). So I'm going to close this.
@dtribble explained more to us, and there are multiple components.
Imagine you're a validator, you're churning away on the chain, and then suddenly your node halts with a big ugly "kernel panic" error message. You chat with other validators and discover they're all having the same problem. Now, how do you proceed?
The first component is how to identify what went wrong. You collectively look at the #3742 flight recorder (slog data) and see that crank 7890 caused delivery 1234 to vat56 to get started, but the kernel paniced before it finished. Vats aren't supposed to be able to make the kernel panic (modulo #4279), so this is by definition a kernel bug. Delivery 1234 was the immediate trigger, and it perform a syscall that provoked a kernel invariant failure.
The second component is how to debug this. You'd like to get the kernel under a debugger as it handles that delivery and poke around. So you'd like to be able to run a copy of the chain locally, with instructions to debugger
just before that delivery begins. This is entirely outside of consensus. I think the way I'd approach this is to modify the kernel sources to check the crankNum
at the top of controller.run()
, or maybe the (vatID, deliveryNum)
pair at the top of deliverAndLogToVat
, and invoke debugger
if they match some particular value. This approach is complicated by the fact that the kernel is bundled and stored in the database, so I'd have to use the "rekernelize" tool to make my kernel modifications. We might be able to make this less labor-intensive by adding an API, so that I could edit e.g. controller.js
to invoke kernel.invokeDebuggerBeforeCrank(7890)
. We could make this a controller API instead, so I would edit cosmic-swingset/src/launch-chain.js
instead of swingset/src/controller.js
.
Now assume that the community has gone through this debugging process and understands the problem. At this point, they must decide on the best course of action. The committed history is currently all blocks up-to-but-excluding the one that contained the fatal delivery. That history includes all the cranks and deliveries from those blocks, and the contents of the run-queue from the end of that block. It's unlikely that we would modify history/ETC to resolve the problem, so there a few likely approaches to pursue:
So the third component is: assuming the community has decided on one of these actions, how will the validators execute it? This is the most severe form of governance: validator software override. We can't really perform a governance vote now because the chain has halted (although #4516 explores an alternative). At this point the chain has halted, so all validators are eagerly standing by to take recovery action. What do we tell them to type?
If the decision is to skip a particular delivery or kill a particular vat in lieu of processing a particular delivery, then we've got a pair of numbers to get into the kernel. We can add a controller.scheduleSkip(crankNum)
or (vatID, deliveryNum)
to tell the kernel "when you get to this delivery, skip it instead". Or controller.scheduleVatTermination(vatID, deliveryNum)
to tell it "when you get to this delivery, terminate the vat instead". Or, if the decision is to replace the kernel bundle, some other controller
API (#4375) would be used at a particular block height (the current block height).
Then we could introduce some sort of config file to cosmic swingset that would read skip/terminate/replace-kernel directives from the file, and submit them to swingset. It would do this on each node restart, rather than being driven through transaction messages.
If we had those pieces, then our instructions to validators would be:
crank-12345: terminate v12
to that config fileTheir node would start up, resume executing transactions from the beginning of the most recent block, then swingset would get to the designated cranknum/deliverynum and perform the alternate action. If the action was to kill the vat, all validators would see the vat being killed (in consensus), and the kernel bug would not be triggered. If the action was to skip the delivery, all validators would skip the delivery, and the kernel bug would not be triggered. If the action was to replace the kernel, the validator would use the controller API to replace the kernel bundle before starting to execute the block, the triggering delivery would be allowed to go through, and the fixed kernel would not suffer the bug.
The code that skips a delivery based on a config file would look a lot like the code that calls debugger
in the same situation, so they should share an implementation.
So the tasks are:
controller
APIs to schedule debugger
, delivery skips, and vat terminations, at a particular crankNum or vatID+deliveryNum
controller
APIs at each node startup
crank-12345: replace-kernel bundle-id-NNN
in the config-file syntax)debugger
at delivery X" option, show it in the debuggerskip
directiveThere are extensions to this which we're not going to pursue right now. The simplest would be: if the kernel crashes during a particular delivery, automatically configure a "terminate vat before delivery NNN" and restart the validator.
Doing that approach would maximize the chances that the chain proceeds forwards, but I think it would increase the chances of divergence and confusion. Death before confusion.
I'm sizing this as a 3: 1 for the kernel code that implements the API and debugger
/skips/terminates at the right time, 1 for the non-trivial unit tests to make sure it actually does that, and 1 for the cosmic-swingset -side config file stuff
@warner What does Michael need to do for this issue? Something related to config files?
At the meeting, @dtribble suggested:
I'm still confused by that suggestion. It might mean we should use a
debugger()
statement to pop out to a debugger, but then we aren't really terminating the vat, we're just adding a breakpoint that fires under some particular condition (e.g. we're about to replay transcript entry N). Maybe it implies a "pause vat" feature that we didn't talk about: instead of terminating the vat, we just want to not deliver messages to it for a while (but retain the option to resume delivering them again in the future). To implement this, I think we'd need to add a new "pause queue". Each time we pull a message off the run-queue and see that it's destined for a paused Vat, we append it to the pause-queue instead of delivering it.I don't know what consequences these new message-ordering rules might have, nor where the authority to pause and resume a Vat should be held. I'm pretty sure a paused vat should retain all its normal references, so paused vats are very different than terminated vats (which the kernel should be able to forget about utterly).
Originally posted by @warner in https://github.com/Agoric/agoric-sdk/issues/514#issuecomment-582687353