futureverse / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
957 stars 85 forks source link

ROBUSTNESS: Protect against user interrupts for calls that need to be atomic #438

Open HenrikBengtsson opened 4 years ago

HenrikBengtsson commented 4 years ago

Background

In interactive R sessions, the user can signal user interrupts by hitting Ctrl-C in the terminal. If this happens while R evaluates a set of R expressions that must all complete or not, there is a risk of breaking the state of a future. In some cases, we can recover from it whereas in others the only solution is to restart R.

Suggestion

In R (>= 3.5.0), we have suspendInterrupts(expr) which suspends user-interrupts with evaluating expression expr.

The first task is to identify places where they can safely protect against user interrupts without risking ending up in a situation where R completely blocks. We can always signal a SIGQUIT (Ctrl-\ in the terminal).

One obvious candidate is for cluster futures in main-worker communication. There should be no need to protect against user-interrupts on the worker's end.

Due to the risk of breaking something, we should probably introduce an R option future.onInterrupts and a corresponding environment variable R_FUTURE_ONINTERRUPTS to allow users/sysadmins to enable or disable this feature. To minimize the introduced overhead from checking these all the time, it's probably better to just do it when the package is loaded, i.e. during .onLoad().

We could start off by enabling these user-interrupt protections only for interactive R sessions.

See also

king-of-poppk commented 1 year ago

This is the last obstacle for me to implement efficient cancellable promises and truly asynchronous Shiny reactives on top of future.

I currently resorted to not interrupting cancelled futures, but just marking them as cancelled and ignoring their return values / conditions, and increasing the number of workers.

I also have an implementation on top of later/callr::r_bg, spawning one process per computation, but that is very slow even though I can properly interrupt expired computations.

king-of-poppk commented 1 year ago

PS: I replaced my homemade later/callr::r_bg implementation with future.callr. Much simpler, but still as slow.

HenrikBengtsson commented 1 year ago

PS: I replaced my homemade later/callr::r_bg implementation with future.callr. Much simpler, but still as slow.

FWIW, note that in the next version of future.callr, future.callr::callr will join multicore in automatically releasing the worker slot if, and only if, the framework identifies that the worker has terminated/crashed. Those two backends were low-hanging fruits, mainly because the worker processes are transient. It might be possible to do something like this for other future backends as well, but I will move forward on those slowly and with great care, as explained in https://www.jottr.org/2023/07/01/parallelly-managing-workers/.

Note that this issue focuses on protecting against user interrupts occurring in the main R session. Hopefully, there is little need for protecting against user interrupts signaled to the worker processes.

king-of-poppk commented 1 year ago

PS: I replaced my homemade later/callr::r_bg implementation with future.callr. Much simpler, but still as slow.

Actually with future.callr backend and px$interrupt() I get a load of these:

Unhandled promise error: CallrFuture (<none>) failed. The reason reported was ‘! callr subprocess failed: could not start R, exited with non-zero status, has crashed or was killed’. Post-mortem diagnostic: The parallel worker (PID 65690) started at 2023-08-02T10:04:12+0000 finished with exit code 1. The total size of the 13 globals exported is 439.84 KiB. The three largest globals are ‘read_csv’ (170.23 KiB of class ‘function’), ‘read_delimited’ (146.92 KiB of class ‘function’) and ‘req’ (23.21 KiB of class ‘function’)

With px$kill() the exit code is -9:

Unhandled promise error: CallrFuture (<none>) failed. The reason reported was ‘! callr subprocess failed: could not start R, exited with non-zero status, has crashed or was killed’. Post-mortem diagnostic: The parallel worker (PID 68588) started at 2023-08-02T10:20:05+0000 finished with exit code -9. The total size of the 8 globals exported is 72.60 KiB. The three largest globals are ‘req’ (23.21 KiB of class ‘function’), ‘d’ (20.33 KiB of class ‘list’) and ‘dotloop’ (12.77 KiB of class ‘function’)

And future::nbrOfFreeWorkers() never goes back up. I think I was tricked into thinking this was working because promises has issues with duplicated promise errors (https://github.com/rstudio/promises/issues/86) and I had set workers = 100 which hid the problem.

FWIW, note that in the next version of future.callr, future.callr::callr will join multicore in automatically releasing the worker slot if, and only if, the framework identifies that the worker has terminated/crashed.

Indeed, the release of the workers works if one uses the latest commit of the develop branch of future.callr:

renv::install("https://github.com/HenrikBengtsson/future.callr/archive/a0db4c055629504049b4612b5e42cd5488fbd111.tar.gz")

~Is a new release planned for soonish?~ DONE, see https://github.com/HenrikBengtsson/future/discussions/695