callr not working with nested loops

jestover commented 10 months ago

library(furrr)
library(callr)

inner_loop <- function(x){future_map_dbl(x, ~ .x)}
outer_loop <- function(x){future_map(x, ~ inner_loop(.x))}
x <- list(1:10, 11:20, 21:30, 31:40)

plan(sequential)
outer_loop(x)
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[2]]
 [1] 11 12 13 14 15 16 17 18 19 20

[[3]]
 [1] 21 22 23 24 25 26 27 28 29 30

[[4]]
 [1] 31 32 33 34 35 36 37 38 39 40

plan(list(tweak(multisession, workers = 2), tweak(multisession, workers = 2)))
outer_loop(x)
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[2]]
 [1] 11 12 13 14 15 16 17 18 19 20

[[3]]
 [1] 21 22 23 24 25 26 27 28 29 30

[[4]]
 [1] 31 32 33 34 35 36 37 38 39 40

plan(list(tweak(callr, workers = 2), tweak(callr, workers = 2)))
outer_loop(x)
Error in (function (.x, .f, ..., .progress = FALSE)  : 
  ℹ In index: 1.
Caused by error:
! object '...furrr_map_fn' not found

plan(list(tweak(callr, workers = 2), tweak(multisession, workers = 2)))
outer_loop(x)
Error in (function (.x, .f, ..., .progress = FALSE)  : 
  ℹ In index: 1.
Caused by error:
! object '...furrr_map_fn' not found

plan(callr(workers = 2))
outer_loop(x)
Error in (function (.x, .f, ..., .progress = FALSE)  : 
  ℹ In index: 1.
Caused by error in `vctrs::vec_c()`:
! Can't convert `..1` <list> to <double>.

plan(list(tweak(multisession, workers = 2), tweak(callr, workers = 2)))
outer_loop(x)
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[2]]
 [1] 11 12 13 14 15 16 17 18 19 20

[[3]]
 [1] 21 22 23 24 25 26 27 28 29 30

[[4]]
 [1] 31 32 33 34 35 36 37 38 39 40

I ran this example on a 2023 MacBook Pro, but I originally discovered the issue running some code on a Linux server. The real code is a cross validation exercise on some large textual data. I kept running into a problem where the memory usage just continuously grows over time, so I switched to the callr backend to try to address the memory issue, but I keep running into other problems. This was the first that I was able to replicate on a small reproducible example. Let me know if there is any other useful information I can provide.

HenrikBengtsson commented 10 months ago

Thanks for the report. I can reproduce this. I also noticed that we get another error without nested parallelization;

library(furrr)
plan(future.callr::callr, workers = 1)

inner_loop <- function(x) { future_map_dbl(x, ~ .x) }
outer_loop <- function(x) { future_map(x, ~ inner_loop(.x)) }
x <- list(1:10, 11:20, 21:30, 31:40)
y <- outer_loop(x)

gives

Error in (function (.x, .f, ..., .progress = FALSE)  : 
  ℹ In index: 1.
Caused by error in `vctrs::vec_c()`:
! Can't convert `..1` <list> to <double>.

It does not happen with other backends, e.g. plan(future.batchtools::batchtools_local), plan(future::cluster, workers = 1), and plan(future::multisession, workers = 2).

HenrikBengtsson commented 10 months ago

I can also reproduce it without NSE, i.e.

library(furrr)
plan(list(tweak(callr, workers = 2), tweak(callr, workers = 2)))
inner_loop <- function(x) { future_map_dbl(x, identity) }
outer_loop <- function(x) { future_map(x, inner_loop) }
x <- list(1:10, 11:20, 21:30, 31:40)
y <- outer_loop(x)

produces the same error.

What's interesting, though, is that it looks specific to furrr. For example, I cannot reproduce it with future.apply;

library(future.apply)
library(future.callr)
plan(list(tweak(callr, workers = 2), tweak(callr, workers = 2)))

inner_loop <- function(x) { future_sapply(x, FUN = identity) }
outer_loop <- function(x) { future_lapply(x, FUN = inner_loop) }
x <- list(1:10, 11:20, 21:30, 31:40)
y <- outer_loop(x)
str(y)
#> List of 4
#>  $ : int [1:10] 1 2 3 4 5 6 7 8 9 10
#>  $ : int [1:10] 11 12 13 14 15 16 17 18 19 20
#>  $ : int [1:10] 21 22 23 24 25 26 27 28 29 30
#>  $ : int [1:10] 31 32 33 34 35 36 37 38 39 40

It also works with doFuture;

library(doFuture)
library(future.callr)
plan(list(tweak(callr, workers = 2), tweak(callr, workers = 2)))

inner_loop <- function(x) { foreach(z = x, .combine = c) %dofuture% z }
outer_loop <- function(x) { foreach(z = x) %dofuture% inner_loop(z) }
x <- list(1:10, 11:20, 21:30, 31:40)
y <- outer_loop(x)
str(y)
#> List of 4
#>  $ : int [1:10] 1 2 3 4 5 6 7 8 9 10
#>  $ : int [1:10] 11 12 13 14 15 16 17 18 19 20
#>  $ : int [1:10] 21 22 23 24 25 26 27 28 29 30
#>  $ : int [1:10] 31 32 33 34 35 36 37 38 39 40

jestover commented 10 months ago

I had noticed the separate errors as well. I had also been getting hard to figure out errors on the real code that I was trying to use callr for. Some examples just in case they are helpful (not sure if they will be without the full context).

Error: CallrFuture (<none>) failed. The reason reported was ‘! callr subprocess failed: could not read result from callr’. Post-mortem diagnostic: The parallel worker (PID 38814) started at 2023-12-28T22:53:37+0000 finished with exit code 0. The total size of the 8 globals exported is 603.19 MiB. The three largest globals are ‘...furrr_dots’ (603.02 MiB of class ‘list’), ‘future_dmr’ (79.36 KiB of class ‘function’) and ‘...furrr_fn’ (55.33 KiB of class ‘function’)
Execution halted

Error in (function (.x, .f, ..., .progress = FALSE)  : ℹ In index: 1.
Caused by error in `do.call()`:
! object '...furrr_map_fn' not found
Calls: robust_mnir ... resolve.list -> signalConditionsASAP -> signalConditions
Execution halted

Error in (function (.x, .f, ..., .progress = FALSE)  : ℹ In index: 1.
Caused by error:
ℹ In index: 1.
Caused by error in `...furrr_fn()`:
! unused arguments (.y = list(c(2, 0, 8, 1, 2, 1, 2, 0, 3, 0, 3, 1, 2, 1, 1, 1, 0, 1, 1, 0, 1, 2, 6, 2, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 2, 2, 2, 0, 0, 0, 0, 1, 1, 2, 3, 12, 0, 0, 0, 0, 0, 0, 5, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 13, 0, 0, 0, 0, 1, 0, 0, 1, 2, 1, 0, 3, 1, 0, 0, 0, 2, 2, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 5, 0, 13, 0, 0, 0, 2, 0, 0, 3, 1, 0, 0, 4, 0, 2, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 5, 4, 2, 0, 0, 1, 0, 0,
0, 0, 0, 1, 1, 3, 1, 1, 0, 2, 1, 0, 1, 1, 0, 0, 3, 0, 1, 1, 1, 1, 0, 0, 0, 2, 1, 0, 2, 4, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 2, 2, 1, 0, 0, 1, 1, 2, 1, 1, 5, 3, 0, 0, 2, 0, 0, 0, 0, 1, 9, 1, 11, 0, 0, 1, 2, 0, 1, 0, 17, 2, 1, 1, 0, 1, 0, 0, 10, 2, 1, 0, 0, 0, 0, 1, 1, 5, 0, 2, 1, 0, 1, 0, 0, 0, 3, 0, 0, 1, 0, 2, 1, 0, 0, 1, 0, 3, 1, 0, 1, 1, 2, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 2, 1, 6, 1, 0, 0
Calls: robust_mnir ... resolve.list -> signalConditionsASAP -> signalConditions
Execution halted

Given that this is only an issue with furrr, would you prefer me to repost the issue there?

jestover commented 10 months ago

Here is an error from a more recent attempt with plan(list(tweak(multisession, workers = 6), tweak(callr, workers = 8)))

Error in unserialize(node$con) :
  MultisessionFuture (<none>) failed to receive message results from cluster RichSOCKnode #4 (PID 255015 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive. The total size of the 8 globals exported is 603.20 MiB. The three largest globals are ‘...furrr_dots’ (603.02 MiB of class ‘list’), ‘future_dmr’ (79.36 KiB of class ‘function’) and ‘...furrr_fn’ (55.33 KiB of class ‘function’)
Calls: robust_mnir ... resolved -> resolved.ClusterFuture -> receiveMessageFromWorker
Execution halted

HenrikBengtsson commented 10 months ago

Thanks for more examples and details.

The error "MultisessionFuture () failed to receive message results from cluster RichSOCKnode #\4 (PID 255015 on localhost ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: No process exists with this PID, i.e. the localhost worker is no longer alive..." suggests that the "multisession" background process terminated abruptly.

Similarly, the error "CallrFuture () failed. The reason reported was ‘! callr subprocess failed: could not read result from callr’..." suggests that the "callr" background process is no longer responding. The post-mortem diagnostic "The parallel worker (PID 38814) started at 2023-12-28T22:53:37+0000 finished with exit code 0" confirms that it is no longer running.

That a background R process terminates prematurely, suggests something exceptional happened in that process. A simple run-time error would not cause this. Instead, it might be due to a core dump (should never happen in R), or that the process runs out of memory. If you could load the same objects and packages into an interactive R session and run the same code, it would most likely also crash. In other words, there's nothing special about background R processes, other than with parallelization we might run way more of them at the same time.

These errors, due to "crashed" workers, are independent of the other errors, including your original error on object '...furrr_map_fn' not found. The latter errors are due to something not working correctly in future, future.callr, or furrr.

Given that this is only an issue with furrr, would you prefer me to repost the issue there?

Let's keep it here until we know a bit more about why this happens.

cc/ @DavisVaughan

HenrikBengtsson / future.callr

callr not working with nested loops #27