EOS VM uses page protection for guarding memory accesses and interrupting execution. Currently, when EOS VM starts execution it prepares its signal handler to handle any faults that occur until execution is complete as an access violation WASM error. This means both faults that occur inside of WASM execution and in any host functions that WASM calls are all reported and treated as a recoverable access violation.
Because EOS VM captures SIGBUS (wholly unnecessary on Linux, but needed on macOS) a substantial number of (very much rare corner case, but still very real) unrecoverable system errors occurring in host functions will instead be treated as a recoverable access violation as if the WASM simply accessed out of bounds memory in its sandbox. This can include an IO error on the DB file, an IO error when swapping, running out of disk space, an unrecoverable ECC error, running out of free huge pages (in heap mode w/ huge pages enabled), and maybe more. These unrecoverable system errors should not be handled as a recoverable WASM memory violation.
Removing SIGBUS from being handled on Linux would generally resolve this problem, though if a host function had a defect causing a SIGSEGV it would fall in to the same improper handling. So for a more thorough solution, now the signal handler will only handle SIGSEGV/SIGBUS/SIGFPE on given memory ranges -- the WASM code & WASM memory. Faults that occur outside these ranges are forwarded to the next handler (or kill the application if EOS VM's handler is the last chained). This behavior is similar to how EOS VM OC's handler operates. I've also removed SIGBUS from being handled on Linux entirely to resolve the exceptionally unlikely scenario of catching an ECC failure inside of WASM memory.
Of course, this means if one of the above system errors are occurring, nodeos will now simply be killed whereas before it'd potentially get stuck in some wedged state that was still cleanly stoppable. While that might sound bad, it's a good thing: we should only be recovering from errors we know we can properly recover from.
This behavior is a theory on AntelopeIO/leap#2242: some fault is masquerading as an access violation due to the current greediness of the handlers.
EOS VM uses page protection for guarding memory accesses and interrupting execution. Currently, when EOS VM starts execution it prepares its signal handler to handle any faults that occur until execution is complete as an
access violation
WASM error. This means both faults that occur inside of WASM execution and in any host functions that WASM calls are all reported and treated as a recoverableaccess violation
.Because EOS VM captures
SIGBUS
(wholly unnecessary on Linux, but needed on macOS) a substantial number of (very much rare corner case, but still very real) unrecoverable system errors occurring in host functions will instead be treated as a recoverableaccess violation
as if the WASM simply accessed out of bounds memory in its sandbox. This can include an IO error on the DB file, an IO error when swapping, running out of disk space, an unrecoverable ECC error, running out of free huge pages (inheap
mode w/ huge pages enabled), and maybe more. These unrecoverable system errors should not be handled as a recoverable WASM memory violation.Removing
SIGBUS
from being handled on Linux would generally resolve this problem, though if a host function had a defect causing aSIGSEGV
it would fall in to the same improper handling. So for a more thorough solution, now the signal handler will only handleSIGSEGV
/SIGBUS
/SIGFPE
on given memory ranges -- the WASM code & WASM memory. Faults that occur outside these ranges are forwarded to the next handler (or kill the application if EOS VM's handler is the last chained). This behavior is similar to how EOS VM OC's handler operates. I've also removedSIGBUS
from being handled on Linux entirely to resolve the exceptionally unlikely scenario of catching an ECC failure inside of WASM memory.Of course, this means if one of the above system errors are occurring, nodeos will now simply be killed whereas before it'd potentially get stuck in some wedged state that was still cleanly stoppable. While that might sound bad, it's a good thing: we should only be recovering from errors we know we can properly recover from.
This behavior is a theory on AntelopeIO/leap#2242: some fault is masquerading as an
access violation
due to the current greediness of the handlers.