HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
946 stars 82 forks source link

MultisessionFuture () failed to call gassign() on cluster ...... #597

Closed elgabbas closed 2 years ago

elgabbas commented 2 years ago

Hello,

I am trying to use the furrr package, but I think I am having a problem related to the use of future.

First, when I prepare to run the script in parallel using the following command, I receive these warnings. I am unsure if I should ignore them or there is an issue I need to solve.

plan(multisession, workers = 20)

During startup - Warning messages: 1: package ‘utils’ in options("defaultPackages") was not found 2: package ‘stats’ in options("defaultPackages") was not found During startup - Warning messages: 1: package ‘utils’ in options("defaultPackages") was not found 2: package ‘stats’ in options("defaultPackages") was not found During startup - Warning messages: 1: package ‘utils’ in options("defaultPackages") was not found During startup - 2: package ‘stats’ in options("defaultPackages") was not found Warning messages: 1: package ‘utils’ in options("defaultPackages") was not found 2: package ‘stats’ in options("defaultPackages") was not found During startup - Warning messages: 1: package ‘utils’ in options("defaultPackages") was not found 2: package ‘stats’ in options("defaultPackages") was not found During startup - Warning messages: 1: package ‘utils’ in options("defaultPackages") was not found 2: package ‘stats’ in options("defaultPackages") was not found During startup - Warning message: package ‘utils’ in options("defaultPackages") was not found During startup - Warning messages: 1: package ‘utils’ in options("defaultPackages") was not found 2: package ‘stats’ in options("defaultPackages") was not found During startup - Warning message: package ‘utils’ in options("defaultPackages") was not found During startup - Warning message: package ‘utils’ in options("defaultPackages") was not found During startup - Warning message: package ‘utils’ in options("defaultPackages") was not found During startup - Warning message: package ‘utils’ in options("defaultPackages") was not found Error: Initialization of plan() failed, because the test future used for validation failed. The reason was: Package 'future' is not installed on worker (r_version: 4.1.2 (2021-11-01); platform: x86_64-conda-linux-gnu (64-bit); os: Linux 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018; hostname: prod-0316)

Then, when I use map function from the furrr package, I receive the following error:

Error: Problem with mutate() column Preds. Preds = future_map(...). MultisessionFuture () failed to call gassign() on cluster RichSOCKnode #2 (PID 114424 on ‘localhost’). The reason reported was ‘error reading from connection’. Post-mortem diagnostic: The total size of the 15 globals exported is 14.99 MiB. The three largest globals are ‘...furrr_chunk_args’ (14.35 MiB of class ‘list’), ‘Models_WS’ (374.44 KiB of class ‘list’) and ‘predictMaxEnt’ (185.81 KiB of class ‘function’) Backtrace:

  1. ├─global::ExtractPreds(Data = CurrD)
  2. │ └─%>%(...)
  3. ├─tidyr::unnest(., cols = c(data))
  4. ├─dplyr::rename_at(...)
  5. │ └─dplyr:::tbl_at_vars(.tbl, .vars, .include_group_vars = TRUE)
  6. │ └─dplyr::tbl_vars(tbl)
  7. │ ├─dplyr:::new_sel_vars(tbl_vars_dispatch(x), group_vars(x))
  8. │ │ └─base::structure(...)
  9. │ └─dplyr:::tbl_vars_dispatch(x)
    1. ├─tidyr::unnest(., cols = c(Preds))
    2. ├─dplyr::mutate(...)
    3. ├─dplyr:::mutate.data.frame(...)
    4. │ └─dplyr:::mutate_cols(.data, ..., caller_env = caller_env())
    5. │ ├─base::withCallingHandlers(...)
    6. │ └─mask$eval_all_mutate(quo)
    7. └─furrr::future_map(...)
    8. └─furrr:::furrr_map_template(...)
    9. └─furrr:::furrr_template(...)
    10. └─future::future(...)
    11. ├─future::run(future)
    12. └─future:::run.Future(future)
    13. ├─future::run(future)
    14. └─future:::run.ClusterFuture(future)
    15. ├─base::suppressWarnings(...)
    16. │ └─base::withCallingHandlers(...)
    17. └─future:::cluster_call(...) Execution halted Warning message: system call failed: Cannot allocate memory

Please note that I am running this code on an HPC computer (CentOS Linux 7 ), with a total of 110 GB ram and 36 nodes. I used conda to install R (v 4.1.2). The same code worked nicely on different R installations but failed on this computer. Here is the results of the session info:

sessionInfo()

R version 4.1.2 (2021-11-01) Platform: x86_64-conda-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS/LAPACK: /home/----/-----/.conda/envs/REnv2/lib/libopenblasp-r0.3.18.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] tibble_3.1.6 readr_2.1.2 furrr_0.2.3 future_1.24.0 purrr_0.3.4 dismo_1.3-5 raster_3.5-15 sp_1.4-6 sf_1.0-6 stringr_1.4.0 [11] tidyr_1.2.0 dplyr_1.0.7 snow_0.4-4

loaded via a namespace (and not attached): [1] Rcpp_1.0.8 pillar_1.7.0 compiler_4.1.2 class_7.3-20 tools_4.1.2 digest_0.6.29 lifecycle_1.0.1 lattice_0.20-45 [9] pkgconfig_2.0.3 rlang_0.4.12 cli_3.2.0 DBI_1.1.2 e1071_1.7-9 terra_1.5-21 hms_1.1.1 globals_0.14.0 [17] generics_0.1.2 vctrs_0.3.8 classInt_0.4-3 grid_4.1.2 tidyselect_1.1.1 glue_1.6.2 listenv_0.8.0 R6_2.5.1 [25] parallelly_1.30.0 fansi_1.0.2 tzdb_0.2.0 magrittr_2.0.2 codetools_0.2-18 ellipsis_0.3.2 units_0.8-0 assertthat_0.2.1 [33] utf8_1.2.2 KernSmooth_2.23-20 stringi_1.7.6 proxy_0.4-26 crayon_1.5.0

The same issue occurred when using other strategies (multisession, multicore, cluster). I do not think this is due to a memory issue because the same code runs well on the same node specifications, but using an older version of R (not installed via conda). Is there something related to using conda?

Any suggestions?

Thanks in advance, Ahmed

orozcoae89 commented 2 years ago

Can you test this code? Is posible that you need the cl object, with follow:

library(future)
workers <-rep("localhost",4)
cl <- makeClusterPSOCK(workers, revtunnel=TRUE, outfile="")
# starting worker pid=134xx on localhost:11xxx at 13:46:34.292
# starting worker pid=134xx on localhost:11xxx at 13:46:34.580
# starting worker pid=134xx on localhost:11xxx at 13:46:34.817
# starting worker pid=134xx on localhost:11xxx at 13:46:35.050
plan <- plan(list(future::tweak(cluster, workers=workers), multisession))
plan

If you dont't solve send us the short example for replicate and help you.

Erick

elgabbas commented 2 years ago

Hello,

Sometimes, I also receive this error

Fatal error: couldn't allocate memory for pointer stack Fatal error: couldn't allocate memory for pointer stack Error: cannot allocate vector of size 9 Kb

caught segfault address (nil), cause 'memory not mapped' An irrecoverable exception occurred. R is aborting now ...

caught segfault address (nil), cause 'memory not mapped' An irrecoverable exception occurred. R is aborting now ...

caught segfault address (nil), cause 'memory not mapped' An irrecoverable exception occurred. R is aborting now ...

caught segfault address (nil), cause 'memory not mapped' An irrecoverable exception occurred. R is aborting now ...

caught segfault address (nil), cause 'memory not mapped' Error: memory exhausted (limit reached?) Execution halted Error: memory exhausted (limit reached?) Execution halted Warning message: system call failed: Cannot allocate memory

caught segfault address 0x5555559b9568, cause 'invalid permissions' An irrecoverable exception occurred. R is aborting now ...

Thanks, Ahmed

elgabbas commented 2 years ago

Thanks @orozcoae89 for your reply. Here is the results of the script you provided.

workers <-rep("localhost",4) cl <- makeClusterPSOCK(workers, revtunnel=TRUE, outfile="")

starting worker pid=175624 on localhost:11798 at 13:59:27.898 starting worker pid=175627 on localhost:11798 at 13:59:27.898 starting worker pid=175625 on localhost:11798 at 13:59:27.899 starting worker pid=175626 on localhost:11798 at 13:59:27.907

starting worker pid=134xx on localhost:11xxx at 13:46:34.292

starting worker pid=134xx on localhost:11xxx at 13:46:34.580

starting worker pid=134xx on localhost:11xxx at 13:46:34.817

starting worker pid=134xx on localhost:11xxx at 13:46:35.050

plan <- plan(list(future::tweak(cluster, workers=workers), multisession)) plan

List of future strategies:

  1. sequential:
    • args: function (..., envir = parent.frame())
    • tweaked: FALSE
    • call: NULL

I will implement my code then and see if it works.

Cheers, Ahmed

orozcoae89 commented 2 years ago

This error is common when you don't have set the connections with remote machines. Can you give us more details about your code for help you?

Erick

elgabbas commented 2 years ago

After running my script, I receive similar error:

Error: Problem with mutate() column Preds. ℹ Preds = future_map(...). ✖ ClusterFuture () failed to call gassign() on cluster RichSOCKnode #3 (PID 175879 on ‘localhost’). The reason reported was ‘error writing to connection’. Post-mortem diagnostic: The total size of the 15 globals exported is 72.55 MiB. The three largest globals are ‘...furrr_chunk_args’ (71.76 MiB of class ‘list’), ‘Models_WS’ (374.44 KiB of class ‘list’) and ‘predictMaxEnt’ (331.60 KiB of class ‘function’)

My code is simple. I am using future_map function from the furrr package to mutate a new column. However, the data is huge and consists of ~1M observations. Nevertheless, the same script worked nicely using the RStudio server on the HPC (same specifications), but now I need to use the same script over different datasets and therefore need to batch the jobs using different R runs .

Cheers, Ahmed

orozcoae89 commented 2 years ago

Well, I think that you could improve your code, because this error is that you have a excedence in the globals fixed by default. More details: https://www.rdocumentation.org/packages/future/versions/1.24.0/topics/future.options. Prove this solution: options(future.globals.maxSize = ??)