HenrikBengtsson / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
951 stars 83 forks source link

Fail to receive from connection Slurm RStudio Server #633

Closed matthewgson closed 2 years ago

matthewgson commented 2 years ago

Hi,

I've been struggling to find relevant answers that might help solve the problem, but after hundreds of googling I felt I was out of luck. I'm wondering if someone could point out where might the problem is.

I'm running parallel with future + doparallel (doFuture) with simple data.table code.

I'm using one note with 122 cores on the slurm server, using

#SBATCH --ntasks=1
#SBATCH --output=rserver.log
#SBATCH --nodes=1
#SBATCH --cpus-per-task=122
#SBATCH --mem=1000gb

...

rserver

This launches R Server (opensource version), and I connect to it with ssh, with connection info generated on rserver.log file:

Create an SSH tunnel with:
ssh -N -L 8080:c0706a-s27.dsfcf:33209 username@servername.edu
Then, open in the local browser:
http://localhost:8080

Below is the setting in my R:

library(tidyverse)
library(data.table)
library(doFuture)
library(progressr)
library(fst)
library(fasttime)

handlers(global = TRUE)
handlers("progress")
options(future.globals.maxSize= 1e20)
options(future.gc=TRUE) 
availableCores()
availableWorkers()
plan(cluster, workers = 120)
# plan(multisession, workers = 120) # Also tried with multisession as well
registerDoFuture()

And the process is basically reading lots of csv files and filtering it in parallel. Here's my code:

csv_parser = function(folder_address, root_symbol = NULL, out_path = NULL,  test = FALSE, type = 1){

  # unzip command for each file
  filepath_list = str_c('unzip -p ', list.files(folder_address, full.names = T))

  if (test==TRUE) {filepath_list = filepath_list[1:5]}

  # read file as data.table and append into list
  p <- progressor(along = filepath_list)

  list_df <- foreach(x = seq_along(filepath_list)) %dopar% {

    p(sprintf("x=%g", x))

    DT = fread(cmd = filepath_list[[x]], fill=TRUE)

    if (!is.null(root_symbol)) {
      DT = DT[root %chin% root_symbol]
    }

    gc()
    return(DT)
  }

  if (is.null(out_path)){
    result = rbindlist(list_df, fill=TRUE)
    # setnames(result, clean_names)
    return(result)
  } else {
    full_DT = rbindlist(list_df, fill=TRUE)
    # setnames(full_DT, clean_names)
    write.fst(full_DT, out_path, compress=100)
  }
}

cboe_parser(folder_address, root_symbol = snp500_tickers, out_path = 'some/path')

It seems it starts up multiple subprocesses but crashes in couple minutes. The error says:

Error in unserialize(node$con) :                                                                                                                                                                                                                                         
  ClusterFuture (doFuture-2) failed to receive results from cluster RichSOCKnode HenrikBengtsson/doFuture#2 (PID 56721 on ‘localhost’). 
The reason reported was ‘error reading from connection’. Post-mortem diagnostic: 
The total size of the 9 globals exported is 145.86 KiB. 
The three largest globals are ‘filepath_list’ (101.57 KiB of class ‘character’), ‘root_symbol’ (35.55 KiB of class ‘character’) and ‘p’ (5.38 KiB of class ‘function’)

It doesn't seem related to the size but I made sure with options(future.globals.maxSize= 1e20) I tried with plan(multisession, workers=120) and plan(cluster, workers=120) but it yielded the same errors.

Here's sessionInfo:

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] RPostgres_1.4.4   dbplyr_2.1.1      fasttime_1.1-0    fst_0.9.8         progressr_0.10.0  doFuture_0.12.2   future_1.26.1     foreach_1.5.2     data.table_1.14.3 forcats_0.5.1     stringr_1.4.0     dplyr_1.0.9       purrr_0.3.4       readr_2.1.2      
[15] tidyr_1.2.0       tibble_3.1.7      ggplot2_3.3.6     tidyverse_1.3.1  

loaded via a namespace (and not attached):
 [1] nlme_3.1-157        matrixStats_0.62.0  fs_1.5.2            lubridate_1.8.0     bit64_4.0.5         progress_1.2.2      webshot_0.5.3       httr_1.4.3          dreamerr_1.2.3      numDeriv_2016.8-1.1 tools_4.1.2         backports_1.4.1     utf8_1.2.2         
[14] R6_2.5.1            DBI_1.1.2           colorspace_2.0-3    withr_2.5.0         prettyunits_1.1.1   tidyselect_1.1.2    bit_4.0.4           compiler_4.1.2      cli_3.2.0           rvest_1.0.2         xml2_1.3.3          ggthemr_1.1.0       sandwich_3.0-1     
[27] fstcore_0.9.12      scales_1.2.0        fixest_0.10.4       systemfonts_1.0.4   digest_0.6.29       rmarkdown_2.14      svglite_2.1.0       pkgconfig_2.0.3     htmltools_0.5.2     parallelly_1.31.1   fastmap_1.1.0       collapse_1.7.6      rlang_1.0.2        
[40] readxl_1.4.0        rstudioapi_0.13     generics_0.1.2      zoo_1.8-10          jsonlite_1.8.0      magrittr_2.0.3      kableExtra_1.3.4    Formula_1.2-4       Rcpp_1.0.8.3        munsell_0.5.0       fansi_1.0.3         lifecycle_1.0.1     stringi_1.7.6      
[53] yaml_2.3.5          blob_1.2.3          grid_4.1.2          parallel_4.1.2      listenv_0.8.0       crayon_1.5.1        lattice_0.20-45     haven_2.5.0         hms_1.1.1           knitr_1.39          pillar_1.7.0        codetools_0.2-18    reprex_2.0.1       
[66] glue_1.6.2          evaluate_0.15       modelr_0.1.8        vctrs_0.4.1         tzdb_0.3.0          cellranger_1.1.0    gtable_0.3.0        assertthat_0.2.1    xfun_0.31           broom_0.8.0         viridisLite_0.4.0   iterators_1.0.14    globals_0.15.0     
[79] ellipsis_0.3.2     

and here's number of cores

> availableCores()
nproc 
  122 
> availableWorkers()
  [1] "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost"
 [22] "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost"
 [43] "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost"
 [64] "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost"
 [85] "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost"
[106] "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost" "localhost"
[127] "localhost" "localhost"

I'd greatly appreciate if you could point where the problem might be. I've been using future (doParallel) for a while in the same cluster setting, and it's been working great, but somehow it started to give this error message recently.