audit exception pathways in cranks delivery, make sure all lead to kernel panic and host app halt

warner commented 3 years ago

What is the Problem Being Solved?

While thinking about #3073 and what happens when a validator disk runs out of space, we decided we should really audit the full kernel delivery pathway to make sure that any non-vat-behavior-triggered failure (e.g. out of disk, unexpected process termination, host application coding error, etc) causes a kernel panic and brings the host application to a halt, rather than committing some portion of state that is not coherent with the cosmos-sdk/tendermint committed state.

For example, if the kernel-side syscall handler attempts to write data to a HostStorage blobstore and runs out of disk space, that will throw an exception and cause the vat process to see an error. We must ensure that it also sets the right "kernel is panicing" flag, so that as soon as the kernel gets control back from the child process, the kernel will throw its own exception. Ultimately the requirement is that c.step()/c.run() rejects its return promise, and that the host application (cosmic-swingset) reacts to this by halting the process, rather than committing anything to any database.

I believe there are basically three things to pay attention to:

syscalls: any surprises that happen during the kernel syscall handler need to be separated into "things which are the vat's fault and justify a consensus terminaton of that one vat" and "everything else, which justifies a kernel panic"
kernel code between entry into c.step()/c.run() and delivery to vat, plus everything from the delivery being complete to c.step() returning
host application code around the call to c.step()/c.run(), with particular attention to what happens when it rejects the result promise

warner commented 2 years ago

We believe the cosmic-swingset side of this is complete: any rejections from c.run() will cause the chain to halt. The remaining study should look at the swingset side to make sure any delivery problems cause a kernel halt and rejection of the c.run promise.

This overlaps with my current work on making sure device invocations cannot crash the kernel (made more visible by the introduction of "raw devices" as part of #1346).

Tartuffo commented 2 years ago

@FUDCo did you take a pass at this? @warner seems to recall you might have.

warner commented 2 years ago

We checked the surface level in kernel.run/step and tryProcessMessage, and we like what we see. We'd like to do a deeper audit later, more of a purple-team exercise, but we don't have the time to do it now. So I'm going to defer the rest of this ticket until after MN-1.

Agoric / agoric-sdk

audit exception pathways in cranks delivery, make sure all lead to kernel panic and host app halt #3237

What is the Problem Being Solved?