AntelopeIO / eos-vm

Other
6 stars 10 forks source link

make signal handler less greedy: only handle signals from expected memory ranges #23

Closed spoonincode closed 8 months ago

spoonincode commented 8 months ago

EOS VM uses page protection for guarding memory accesses and interrupting execution. Currently, when EOS VM starts execution it prepares its signal handler to handle any faults that occur until execution is complete as an access violation WASM error. This means both faults that occur inside of WASM execution and in any host functions that WASM calls are all reported and treated as a recoverable access violation.

Because EOS VM captures SIGBUS (wholly unnecessary on Linux, but needed on macOS) a substantial number of (very much rare corner case, but still very real) unrecoverable system errors occurring in host functions will instead be treated as a recoverable access violation as if the WASM simply accessed out of bounds memory in its sandbox. This can include an IO error on the DB file, an IO error when swapping, running out of disk space, an unrecoverable ECC error, running out of free huge pages (in heap mode w/ huge pages enabled), and maybe more. These unrecoverable system errors should not be handled as a recoverable WASM memory violation.

Removing SIGBUS from being handled on Linux would generally resolve this problem, though if a host function had a defect causing a SIGSEGV it would fall in to the same improper handling. So for a more thorough solution, now the signal handler will only handle SIGSEGV/SIGBUS/SIGFPE on given memory ranges -- the WASM code & WASM memory. Faults that occur outside these ranges are forwarded to the next handler (or kill the application if EOS VM's handler is the last chained). This behavior is similar to how EOS VM OC's handler operates. I've also removed SIGBUS from being handled on Linux entirely to resolve the exceptionally unlikely scenario of catching an ECC failure inside of WASM memory.

Of course, this means if one of the above system errors are occurring, nodeos will now simply be killed whereas before it'd potentially get stuck in some wedged state that was still cleanly stoppable. While that might sound bad, it's a good thing: we should only be recovering from errors we know we can properly recover from.

This behavior is a theory on AntelopeIO/leap#2242: some fault is masquerading as an access violation due to the current greediness of the handlers.