tools to debug+resume after kernel panic

warner commented 4 years ago

At the meeting, @dtribble suggested:

“halt after N turns”. That way we can both permanently halt the
block and also set the value to halt earlier during replay (e.g. to
allow debugging of a failing program)

I'm still confused by that suggestion. It might mean we should use a debugger() statement to pop out to a debugger, but then we aren't really terminating the vat, we're just adding a breakpoint that fires under some particular condition (e.g. we're about to replay transcript entry N). Maybe it implies a "pause vat" feature that we didn't talk about: instead of terminating the vat, we just want to not deliver messages to it for a while (but retain the option to resume delivering them again in the future). To implement this, I think we'd need to add a new "pause queue". Each time we pull a message off the run-queue and see that it's destined for a paused Vat, we append it to the pause-queue instead of delivering it.

I don't know what consequences these new message-ordering rules might have, nor where the authority to pause and resume a Vat should be held. I'm pretty sure a paused vat should retain all its normal references, so paused vats are very different than terminated vats (which the kernel should be able to forget about utterly).

Originally posted by @warner in https://github.com/Agoric/agoric-sdk/issues/514#issuecomment-582687353

warner commented 4 years ago

assigning to Dean to let him explain this better

warner commented 2 years ago

@dtribble hasn't explained this better, and #3528 explores the notion of pausing vats (for metering faults, rather than kernel bugs). So I'm going to close this.

warner commented 2 years ago

@dtribble explained more to us, and there are multiple components.

Imagine you're a validator, you're churning away on the chain, and then suddenly your node halts with a big ugly "kernel panic" error message. You chat with other validators and discover they're all having the same problem. Now, how do you proceed?

The first component is how to identify what went wrong. You collectively look at the #3742 flight recorder (slog data) and see that crank 7890 caused delivery 1234 to vat56 to get started, but the kernel paniced before it finished. Vats aren't supposed to be able to make the kernel panic (modulo #4279), so this is by definition a kernel bug. Delivery 1234 was the immediate trigger, and it perform a syscall that provoked a kernel invariant failure.

The second component is how to debug this. You'd like to get the kernel under a debugger as it handles that delivery and poke around. So you'd like to be able to run a copy of the chain locally, with instructions to debugger just before that delivery begins. This is entirely outside of consensus. I think the way I'd approach this is to modify the kernel sources to check the crankNum at the top of controller.run(), or maybe the (vatID, deliveryNum) pair at the top of deliverAndLogToVat, and invoke debugger if they match some particular value. This approach is complicated by the fact that the kernel is bundled and stored in the database, so I'd have to use the "rekernelize" tool to make my kernel modifications. We might be able to make this less labor-intensive by adding an API, so that I could edit e.g. controller.js to invoke kernel.invokeDebuggerBeforeCrank(7890). We could make this a controller API instead, so I would edit cosmic-swingset/src/launch-chain.js instead of swingset/src/controller.js.

Now assume that the community has gone through this debugging process and understands the problem. At this point, they must decide on the best course of action. The committed history is currently all blocks up-to-but-excluding the one that contained the fatal delivery. That history includes all the cranks and deliveries from those blocks, and the contents of the run-queue from the end of that block. It's unlikely that we would modify history/ETC to resolve the problem, so there a few likely approaches to pursue:

modify the kernel bundle, then restart from the fault-triggering block
insert a high-priority vat-upgrade message to modify the triggering vat before it receives the triggering message (unlikely: this is pretty complex, plus it's really a kernel bug, so patching a vat feels like addressing the symptom instead of the problem)
arrange to skip the triggering delivery (this messes with the semantics of message ordering, and puts the vat into a weird state, but might be appropriate)
arrange to kill the triggering vat just before it processes the triggering delivery

So the third component is: assuming the community has decided on one of these actions, how will the validators execute it? This is the most severe form of governance: validator software override. We can't really perform a governance vote now because the chain has halted (although #4516 explores an alternative). At this point the chain has halted, so all validators are eagerly standing by to take recovery action. What do we tell them to type?

If the decision is to skip a particular delivery or kill a particular vat in lieu of processing a particular delivery, then we've got a pair of numbers to get into the kernel. We can add a controller.scheduleSkip(crankNum) or (vatID, deliveryNum) to tell the kernel "when you get to this delivery, skip it instead". Or controller.scheduleVatTermination(vatID, deliveryNum) to tell it "when you get to this delivery, terminate the vat instead". Or, if the decision is to replace the kernel bundle, some other controller API (#4375) would be used at a particular block height (the current block height).

Then we could introduce some sort of config file to cosmic swingset that would read skip/terminate/replace-kernel directives from the file, and submit them to swingset. It would do this on each node restart, rather than being driven through transaction messages.

If we had those pieces, then our instructions to validators would be:

add crank-12345: terminate v12 to that config file
start your validator

Their node would start up, resume executing transactions from the beginning of the most recent block, then swingset would get to the designated cranknum/deliverynum and perform the alternate action. If the action was to kill the vat, all validators would see the vat being killed (in consensus), and the kernel bug would not be triggered. If the action was to skip the delivery, all validators would skip the delivery, and the kernel bug would not be triggered. If the action was to replace the kernel, the validator would use the controller API to replace the kernel bundle before starting to execute the block, the triggering delivery would be allowed to go through, and the fixed kernel would not suffer the bug.

The code that skips a delivery based on a config file would look a lot like the code that calls debugger in the same situation, so they should share an implementation.

So the tasks are:

swingset: add controller APIs to schedule debugger, delivery skips, and vat terminations, at a particular crankNum or vatID+deliveryNum
- swingset: add kernel support to take those actions at the right time
cosmic-swingset: define a config file syntax, add code to read the config file, call the controller APIs at each node startup
- (maybe depend upon #4375 to replace the kernel bundle, and include crank-12345: replace-kernel bundle-id-NNN in the config-file syntax)
demonstrate the workflow by:
- introducing a kernel-crashing bug
- submitting a txn that makes a delivery which triggers the bug
- look at the slog output, identify which delivery caused the problem
- re-run the delivery locally with the "invoke debugger at delivery X" option, show it in the debugger
- pretend to decide that skipping the delivery is the best course of action
- modify the config file with a skip directive
- restart the validators, watch them all skip that delivery and proceed forwards

There are extensions to this which we're not going to pursue right now. The simplest would be: if the kernel crashes during a particular delivery, automatically configure a "terminate vat before delivery NNN" and restart the validator.

problem: how would validator 1 know that this is a common-mode failure, and not something specific to the one validator?
problem: why do we think this vat is at fault, it could be a problem set up by some other vat, and the triggering vat is the victim
problem: why do we think any vat is at fault, the problem could be entirely within the kernel

Doing that approach would maximize the chances that the chain proceeds forwards, but I think it would increase the chances of divergence and confusion. Death before confusion.

warner commented 2 years ago

I'm sizing this as a 3: 1 for the kernel code that implements the API and debugger/skips/terminates at the right time, 1 for the non-trivial unit tests to make sure it actually does that, and 1 for the cosmic-swingset -side config file stuff

Tartuffo commented 2 years ago

@warner What does Michael need to do for this issue? Something related to config files?

Agoric / agoric-sdk

tools to debug+resume after kernel panic #1677