HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
946 stars 82 forks source link

future() and resolved() handle FutureError:s differently for different backends #696

Open HenrikBengtsson opened 11 months ago

HenrikBengtsson commented 11 months ago

Issue

Future orchestration errors (i.e. FutureError) occurring when calling future() and resolve() are handled differently depending on future backend. Below are a few examples.

multicore

library(future)
plan(multicore, workers = 2)

segfault <- function(ii) {
  if (ii == 2) tools::pskill(Sys.getpid()) else Sys.sleep(1)
  ii
}

fs <- lapply(1:4, FUN = function(ii) {
  message(sprintf("Launching future #%d", ii))
  future({ segfault(ii) })
})
#> Launching future #1
#> Launching future #2
#> Launching future #3
#> Launching future #4
#> Warning message:
#> In mccollect(jobs = jobs, wait = TRUE) :
#>   1 parallel job did not deliver a result
#> Calls: lapply ... value.Future -> result -> result.MulticoreFuture -> mccollect

resolved(fs)
#> [1] TRUE TRUE TRUE TRUE

fs <- resolve(fs)

rs <- lapply(fs, FUN = result)
#> Error: Failed to retrieve the result of MulticoreFuture (<none>) from the
#> forked worker (on localhost; PID 687068). Post-mortem diagnostic: No
#> process exists with this PID, i.e. the forked localhost worker is no longer
#> alive. The total size of the 2 globals exported is 6.52 KiB. There are two
#> globals: 'segfault' (6.47 KiB of class 'function') and 'ii' (56 bytes of
#> class 'numeric')

vs <- value(fs)
#> Error: Failed to retrieve the result of MulticoreFuture (<none>) from the
#> forked worker (on localhost; PID 687068). Post-mortem diagnostic: No
#> process exists with this PID, i.e. the forked localhost worker is no longer
#> alive. The total size of the 2 globals exported is 6.52 KiB. There are two
#> globals: 'segfault' (6.47 KiB of class 'function') and 'ii' (56 bytes of
#> class 'numeric')

multisession

library(future)
plan(multisession, workers = 2)

segfault <- function(ii) {
  if (ii == 2) tools::pskill(Sys.getpid()) else Sys.sleep(1)
  ii
}

fs <- lapply(1:4, FUN = function(ii) {
  message(sprintf("Launching future #%d", ii))
  future({ segfault(ii) })
})
#> Launching future #1
#> Launching future #2
#> Launching future #3
#> Error in unserialize(node$con) : 
#>   MultisessionFuture (<none>) failed to receive message results from
#> cluster RichSOCKnode #2 (PID 687611 on localhost 'localhost'). The reason
#> reported was 'error reading from connection'. Post-mortem diagnostic: No
#> process exists with this PID, i.e. the localhost worker is no longer alive.
#> The total size of the 2 globals exported is 6.52 KiB. There are two
#> globals: 'segfault' (6.47 KiB of class 'function') and 'ii' (56 bytes of
#> class 'numeric')

future.callr::callr

library(future)
plan(future.callr::callr, workers = 2)

segfault <- function(ii) {
  if (ii == 2) tools::pskill(Sys.getpid()) else Sys.sleep(1)
  ii
}

fs <- lapply(1:4, FUN = function(ii) {
  message(sprintf("Launching future #%d", ii))
  future({ segfault(ii) })
})
#> Launching future #1
#> Launching future #2
#> Launching future #3
#> Launching future #4

> resolved(fs)
#> Error: CallrFuture (<none>) failed. The reason reported was '! callr
#> subprocess failed: could not start R, exited with non-zero status, has
#> crashed or was killed'. Post-mortem diagnostic: The parallel worker
#> (PID 686807) started at 2023-08-08T09:08:46+0000 finished with exit
#> code -15. The total size of the 2 globals exported is 6.52 KiB. There
#> are two globals: 'segfault' (6.47 KiB of class 'function') and 'ii'
#> (56 bytes of class 'numeric')

fs <- resolve(fs)
#> Error: CallrFuture (<none>) failed. The reason reported was '! callr
#> subprocess failed: could not start R, exited with non-zero status, has
#> crashed or was killed'. Post-mortem diagnostic: The parallel worker
#> (PID 686807) started at 2023-08-08T09:08:46+0000 finished with exit
#> code -15. The total size of the 2 globals exported is 6.52 KiB. There
#> are two globals: 'segfault' (6.47 KiB of class 'function') and 'ii'
#> (56 bytes of class 'numeric')

> rs <- lapply(fs, FUN = result)
#> Error: CallrFuture (<none>) failed. The reason reported was '! callr
#> subprocess failed: could not start R, exited with non-zero status, has
#> crashed or was killed'. Post-mortem diagnostic: The parallel worker
#> (PID 686807) started at 2023-08-08T09:08:46+0000 finished with exit
#> code -15. The total size of the 2 globals exported is 6.52 KiB. There
#> are two globals: 'segfault' (6.47 KiB of class 'function') and 'ii'
#> (56 bytes of class 'numeric')

> vs <- value(fs)
#> Error: CallrFuture (<none>) failed. The reason reported was '! callr
#> subprocess failed: could not start R, exited with non-zero status, has
#> crashed or was killed'. Post-mortem diagnostic: The parallel worker
#> (PID 686807) started at 2023-08-08T09:08:46+0000 finished with exit
#> code -15. The total size of the 2 globals exported is 6.52 KiB. There
#> are two globals: 'segfault' (6.47 KiB of class 'function') and 'ii'
#> (56 bytes of class 'numeric')

Suggestion

Harmonize the behavior. This is related to releasing future slots for failed futures.

See also

This is related to https://github.com/HenrikBengtsson/future.callr/issues/11.