Bioconductor / BiocParallel

Bioconductor facilities for parallel evaluation
https://bioconductor.org/packages/BiocParallel
65 stars 29 forks source link

stop.on.error = FALSE for DoParam doesn't work as expected #240

Closed GiuliaPais closed 1 year ago

GiuliaPais commented 1 year ago

As per title, when initialising a new DoParam object with option stop.on.error = FALSE, the evaluation should not stop on error but it does. Here is a reprex that demonstrates it

foo <- function(num_vec) {
  vec_sum <- sum(num_vec, na.rm = TRUE)
  # calls a function not declared - this should raise error
  vec_fun <- bar(num_vec)
  return(list(sum = vec_sum, fun = vec_fun))
}

launch_par_function <- function(max_workers = 4, stop_on_error = TRUE) {
  old_be <- doFuture::registerDoFuture()
  old_plan <- future::plan(future::multisession, workers = max_workers)
  on.exit(
    {
      future::plan(old_plan)
      foreach::setDoPar(
        fun = old_be$fun,
        data = old_be$data, info = old_be$info
      )
    },
    add = TRUE
  )
  p <- BiocParallel::DoparParam(stop.on.error = stop_on_error)

  data_list <- list(
    A = c(1, 2, 3, 4, 5),
    B = c(1, 3, 5, 7, 9)
  )

  results <- BiocParallel::bplapply(
    X = data_list,
    FUN = foo,
    BPPARAM = p
  )

  return(results)
}

# should not error
result <- launch_par_function(stop_on_error = FALSE)
#> Error: BiocParallel errors
#>   2 remote errors, element index: 1, 2
#>   0 unevaluated and other errors
#>   first remote error:
#> Error in bar(num_vec): could not find function "bar"

# should error
result <- launch_par_function(stop_on_error = TRUE)
#> Error: BiocParallel errors
#>   2 remote errors, element index: 1, 2
#>   0 unevaluated and other errors
#>   first remote error:
#> Error in bar(num_vec): could not find function "bar"

Created on 2023-02-20 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.1 (2022-06-23) #> os macOS Big Sur ... 10.16 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Rome #> date 2023-02-20 #> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> BiocParallel 1.32.5 2022-12-23 [1] Bioconductor #> cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.0) #> codetools 0.2-19 2023-02-01 [1] CRAN (R 4.2.0) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.1) #> doFuture 0.12.2 2022-04-26 [1] CRAN (R 4.2.0) #> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0) #> foreach 1.5.2 2022-02-02 [1] CRAN (R 4.2.0) #> fs 1.6.1 2023-02-06 [1] CRAN (R 4.2.0) #> future 1.31.0 2023-02-01 [1] CRAN (R 4.2.0) #> globals 0.16.2 2022-11-21 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.0) #> iterators 1.0.14 2022-02-05 [1] CRAN (R 4.2.0) #> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.1) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0) #> listenv 0.9.0 2022-12-16 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> parallelly 1.34.0 2023-01-13 [1] CRAN (R 4.2.0) #> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.2.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0) #> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0) #> rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.2.0) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> styler 1.9.0 2023-01-15 [1] CRAN (R 4.2.0) #> vctrs 0.5.2 2023-01-23 [1] CRAN (R 4.2.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.37 2023-01-31 [1] CRAN (R 4.2.0) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
mtmorgan commented 1 year ago

Thanks for the report.

It might help to clarify what stop.on.error= does. Suppose there are two workers, and 4 tasks numbered 1, 2, 3, 4. Each task is to compute foo(). Each worker gets two tasks, 1 and 2 to the first worker, 3 and 4 to the second worker. If stop.on.error=TRUE, the first worker tries to evaluate foo(1). This fails, and since stop.on.error=TRUE, it does not try to evaluate task 2. Likewise, the second worker, in parallel with the first, tries to evaluate foo(3). This fails, and the second worker does not try to evaluate foo(4). This is reported as 2 remote errors, and 2 unevaluated errors

> bplapply(1:4, foo, BPPARAM = SnowParam(2, stop.on.error = TRUE))
Error: BiocParallel errors
  2 remote errors, element index: 1, 3
  2 unevaluated and other errors
  first remote error:
Error in bar(num_vec): could not find function "bar"

Using stop.on.error = TRUE is very appropriate in this situation, since bar() will not magically be available to other tasks on the same worker.

Suppose stop.on.error = FALSE. The first worker tries task 1 (foo(1)). This fails, so it tries foo(2), which also fails. Likewise for the second worker, trying foo(3) and then foo(4). We see 4 remote errors

> bplapply(1:4, foo, BPPARAM = SnowParam(2, stop.on.error = FALSE))
Error: BiocParallel errors
  4 remote errors, element index: 1, 2, 3, 4
  0 unevaluated and other errors
  first remote error:
Error in bar(num_vec): could not find function "bar"

This might be appropriate if the error was somehow stochastic, e.g., a numerical method sometimes failed to converge, but it might make sense to continue trying other tasks...

In your code, you've set the number of workers to 4. There are only two tasks (A and B), so one worker gets task A, the other task B. Both error, there are no more tasks for either worker, and stop.on.error makes no difference. This is what you report -- 2 remote errors regardless of the value of stop.on.error.

You can see the expected behavior if you arrange for more tasks than workers, e.g., by adding two tasks to data_list and reducing the number of workers to 2

> launch_par_function(2, stop_on_error = TRUE)
Error: BiocParallel errors
  2 remote errors, element index: 1, 3
  2 unevaluated and other errors
  first remote error:
Error in bar(num_vec): could not find function "bar"
> launch_par_function(2, stop_on_error = FALSE)
Error: BiocParallel errors
  4 remote errors, element index: 1, 2, 3, 4
  0 unevaluated and other errors
  first remote error:
Error in bar(num_vec): could not find function "bar"

stop.on.error = TRUE doesn't stop after the very first error on any worker, because that would force sequential evaluation.

GiuliaPais commented 1 year ago

Thanks for the clarification. Then I guess to achieve my initial expected behaviour it is suitable an approach using purrr::safely and the handling of errors downstream if they arise. Thanks again