futureverse / future

:rocket: R package: future: Unified Parallel and Distributed Processing in R for Everyone
https://future.futureverse.org
956 stars 85 forks source link

Nested cluster hangs and errors: Failed to retrieve the value of ClusterFuture (<none>) from cluster SOCKnode #1. The reason reported was ‘error reading from connection’ #370

Closed ercbk closed 4 years ago

ercbk commented 4 years ago

I'm running a nested cross-validation script similar to this one on 2 AWS instances. For smaller nested structures (i.e. fewer folds, resamples, and repeats), everything works fine, but for larger structures that have run times of around 13 minutes, the future_map2 hangs and eventually ends some time later with an error like "Error in unserialize(node$con) :Failed to retrieve the value of ClusterFuture () from cluster SOCKnode #1 (PID 18309 on ‘13.48.133.115’). The reason reported was ‘error reading from connection’" I'm estimating the runtimes by watching the processes through SSH'ing the instances, but it doesn't error for another 30 or so minutes later. This configuration I'm using:

cl <- makeClusterPSOCK(

   ## Public IP number of EC2 instance
   public_ip,

   ## User name (always 'ubuntu')
   user = "ubuntu",

   ## Use private SSH key registered with AWS
   rshcmd = c("plink", "-ssh", "-i", ssh_private_key_file),
   rshopts = c(
      "-sshrawlog", "ec2-ssh-raw.log"
   ),

   rscript_args = c("-e", shQuote(".libPaths('/home/rstudio/R/x86_64-pc-linux-gnu-library/3.6')")
   ),

   dryrun = F, 
   verbose = TRUE
)

plan(list(tweak(cluster, workers = cl), multiprocess))

I don't think it can be a RAM issue, because I've ran the script on a couple r5x8large instances and the most RAM that's every been used is around 8 GB. The SSH logs for failed and successful runs are in my dropbox. There's also logs in there for a run on 2 t3 instances that were on smaller nested structures and have shorter logs which might be easier to look through.
I've also gotten the same error on basic nested structures that run over 13 min. Think this ran for about 20 min on the t3 instances and errored around 30 mins after that.

res <- future_map(

   # Map over the two instances
   .x = c(1, 2), 

   .f = ~ {

      outer_idx <- .x

      future_map(

         # Each instance has 4 cores we can utilize
         .x = c(1:12), 

         .f = ~ {
            inner_idx <- .x
            Sys.sleep(100)
            paste0("Instance: ", outer_idx, " and core: ", inner_idx)
         }
      )

   }
)

I also get an error even if there's 5 min of inactivity. I run makeClusterPSOCK and plan, wait five minutes, run the basic nested code above, and get the error, "Error in serialize(data, node$con) : error writing to connection". Not sure if that's related or not. I ran this test with and without firewall and antivirus and it was the same error result. I've also looked through the PUTTY/plink options/issues and nothing looked relevant to my problem.
This feels like a timeout or some other networking issue, but when I ssh through putty it never timeouts or disconnects. Is this a non-interactive ssh connection issue? Networking is all magic to me. Would be appreciate any help you can offer.

Windows system info:
OS Name: Microsoft Windows 10 Pro
OS Version: 10.0.18362 N/A Build 18362
Network Card(s): 3 NIC(s) Installed.
[01]: 802.11n USB Wireless LAN Card
Connection Name: Wi-Fi
Status: Media disconnected
[02]: Intel(R) Ethernet Connection I217-LM
Connection Name: Ethernet
DHCP Enabled: No

aws nested-cv session info ```r - Session info ------------------------------------------------------------------------------------------ setting value version R version 3.6.2 (2019-12-12) os Windows 10 x64 system x86_64, mingw32 ui RStudio language (EN) collate English_United States.1252 ctype English_United States.1252 tz America/New_York date 2020-04-14 - Packages ---------------------------------------------------------------------------------------------- package * version date lib source askpass 1.1 2019-01-13 [1] CRAN (R 3.6.1) assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.1) backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1) base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.6.0) bayesplot 1.7.1 2019-12-01 [1] CRAN (R 3.6.2) boot 1.3-24 2019-12-20 [1] CRAN (R 3.6.2) broom * 0.5.5 2020-02-29 [1] CRAN (R 3.6.3) callr 3.4.3 2020-03-28 [1] CRAN (R 3.6.2) class 7.3-15 2019-01-01 [2] CRAN (R 3.6.2) cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.3) clipr 0.7.0 2019-07-23 [1] CRAN (R 3.6.1) codetools 0.2-16 2018-12-24 [2] CRAN (R 3.6.2) colorspace 1.4-1 2019-03-18 [1] CRAN (R 3.6.1) colourpicker 1.0 2017-09-27 [1] CRAN (R 3.6.1) crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.1) crosstalk 1.1.0.1 2020-03-13 [1] CRAN (R 3.6.3) curl 4.3 2019-12-02 [1] CRAN (R 3.6.2) data.table * 1.12.8 2019-12-09 [1] CRAN (R 3.6.2) desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.1) details * 0.2.1 2020-01-12 [1] CRAN (R 3.6.2) dials * 0.0.4 2019-12-02 [1] CRAN (R 3.6.2) DiceDesign 1.8-1 2019-07-31 [1] CRAN (R 3.6.1) digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.2) dplyr * 0.8.5 2020-03-07 [1] CRAN (R 3.6.3) DT 0.13 2020-03-23 [1] CRAN (R 3.6.3) dtplyr * 1.0.1 2020-01-23 [1] CRAN (R 3.6.2) dygraphs 1.1.1.6 2018-07-11 [1] CRAN (R 3.6.1) ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1) evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.1) fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.2) fastmap 1.0.1 2019-10-08 [1] CRAN (R 3.6.1) foreach 1.4.8 2020-02-09 [1] CRAN (R 3.6.2) forge 0.2.0 2019-02-26 [1] CRAN (R 3.6.1) furrr * 0.1.0 2018-05-16 [1] CRAN (R 3.6.1) future * 1.16.0 2020-01-16 [1] CRAN (R 3.6.2) generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.1) ggplot2 * 3.3.0.9000 2020-04-04 [1] Github (tidyverse/ggplot2@bca6105) ggridges 0.5.2 2020-01-12 [1] CRAN (R 3.6.2) globals 0.12.5 2019-12-07 [1] CRAN (R 3.6.1) glue * 1.4.0 2020-04-03 [1] CRAN (R 3.6.2) gower 0.2.1 2019-05-14 [1] CRAN (R 3.6.1) GPfit 1.0-8 2019-02-08 [1] CRAN (R 3.6.2) gridExtra 2.3 2017-09-09 [1] CRAN (R 3.6.1) gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.1) gtools 3.8.1 2018-06-26 [1] CRAN (R 3.6.0) htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.1) htmlwidgets 1.5.1 2019-10-08 [1] CRAN (R 3.6.1) httpuv 1.5.2 2019-09-11 [1] CRAN (R 3.6.1) httr 1.4.1 2019-08-05 [1] CRAN (R 3.6.1) igraph 1.2.5 2020-03-19 [1] CRAN (R 3.6.3) infer * 0.5.1 2019-11-19 [1] CRAN (R 3.6.2) ini 0.3.1 2018-05-20 [1] CRAN (R 3.6.1) inline 0.3.15 2018-05-18 [1] CRAN (R 3.6.1) ipred 0.9-9 2019-04-28 [1] CRAN (R 3.6.1) iterators 1.0.12 2019-07-26 [1] CRAN (R 3.6.1) janeaustenr 0.1.5 2017-06-10 [1] CRAN (R 3.6.1) jsonlite 1.6.1 2020-02-02 [1] CRAN (R 3.6.2) knitr 1.28 2020-02-06 [1] CRAN (R 3.6.2) later 1.0.0 2019-10-04 [1] CRAN (R 3.6.1) lattice 0.20-38 2018-11-04 [2] CRAN (R 3.6.2) lava 1.6.7 2020-03-05 [1] CRAN (R 3.6.3) lhs 1.0.1 2019-02-03 [1] CRAN (R 3.6.1) lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.3) listenv 0.8.0 2019-12-05 [1] CRAN (R 3.6.2) lme4 1.1-21 2019-03-05 [1] CRAN (R 3.6.1) loo 2.2.0 2019-12-19 [1] CRAN (R 3.6.2) lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.6.1) magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.1) markdown 1.1 2019-08-07 [1] CRAN (R 3.6.1) MASS 7.3-51.4 2019-03-31 [2] CRAN (R 3.6.2) Matrix 1.2-18 2019-11-27 [2] CRAN (R 3.6.2) matrixStats 0.56.0 2020-03-13 [1] CRAN (R 3.6.3) mime 0.9 2020-02-04 [1] CRAN (R 3.6.2) miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 3.6.1) minqa 1.2.4 2014-10-09 [1] CRAN (R 3.6.1) mlflow * 1.7.0 2020-03-03 [1] CRAN (R 3.6.3) munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.1) nlme 3.1-145 2020-03-04 [1] CRAN (R 3.6.3) nloptr 1.2.2.1 2020-03-11 [1] CRAN (R 3.6.3) nnet 7.3-12 2016-02-02 [2] CRAN (R 3.6.2) openssl 1.4.1 2019-07-18 [1] CRAN (R 3.6.1) pacman 0.5.1 2019-03-11 [1] CRAN (R 3.6.1) parsnip * 0.0.5 2020-01-07 [1] CRAN (R 3.6.2) pillar 1.4.3 2019-12-20 [1] CRAN (R 3.6.2) pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.1) pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.1) plyr 1.8.6 2020-03-03 [1] CRAN (R 3.6.3) png 0.1-7 2013-12-03 [1] CRAN (R 3.6.0) prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.2) pROC 1.16.2 2020-03-19 [1] CRAN (R 3.6.3) processx 3.4.2 2020-02-09 [1] CRAN (R 3.6.2) prodlim 2019.11.13 2019-11-17 [1] CRAN (R 3.6.2) promises 1.1.0 2019-10-04 [1] CRAN (R 3.6.1) ps 1.3.2 2020-02-13 [1] CRAN (R 3.6.3) purrr * 0.3.3 2019-10-18 [1] CRAN (R 3.6.2) R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.2) ranger * 0.12.1 2020-01-10 [1] CRAN (R 3.6.2) Rcpp 1.0.4 2020-03-17 [1] CRAN (R 3.6.3) recipes * 0.1.10 2020-03-18 [1] CRAN (R 3.6.3) reshape2 1.4.3 2017-12-11 [1] CRAN (R 3.6.1) reticulate 1.14 2019-12-17 [1] CRAN (R 3.6.2) rlang 0.4.5 2020-03-01 [1] CRAN (R 3.6.3) rmarkdown 2.1 2020-01-20 [1] CRAN (R 3.6.2) rpart 4.1-15 2019-04-12 [2] CRAN (R 3.6.2) rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.3) RPushbullet * 0.3.3 2020-01-19 [1] CRAN (R 3.6.2) rsample * 0.0.5 2019-07-12 [1] CRAN (R 3.6.1) rsconnect 0.8.16 2019-12-13 [1] CRAN (R 3.6.2) rstan 2.19.3 2020-02-11 [1] CRAN (R 3.6.3) rstanarm 2.19.3 2020-02-11 [1] CRAN (R 3.6.3) rstantools 2.0.0 2019-09-15 [1] CRAN (R 3.6.1) rstudioapi 0.11 2020-02-07 [1] CRAN (R 3.6.3) scales * 1.1.0 2019-11-18 [1] CRAN (R 3.6.2) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.1) shiny 1.4.0.2 2020-03-13 [1] CRAN (R 3.6.3) shinyjs 1.1 2020-01-13 [1] CRAN (R 3.6.2) shinystan 2.5.0 2018-05-01 [1] CRAN (R 3.6.1) shinythemes 1.1.2 2018-11-06 [1] CRAN (R 3.6.1) SnowballC 0.6.0 2019-01-15 [1] CRAN (R 3.6.0) StanHeaders 2.21.0-1 2020-01-19 [1] CRAN (R 3.6.2) stringi 1.4.6 2020-02-17 [1] CRAN (R 3.6.2) stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.1) survival 3.1-11 2020-03-07 [1] CRAN (R 3.6.3) swagger 3.9.2 2018-03-23 [1] CRAN (R 3.6.0) threejs 0.3.3 2020-01-21 [1] CRAN (R 3.6.2) tibble * 3.0.0 2020-03-30 [1] CRAN (R 3.6.2) tictoc * 1.0 2014-06-17 [1] CRAN (R 3.6.0) tidymodels * 0.1.0 2020-02-16 [1] CRAN (R 3.6.3) tidyposterior 0.0.2 2018-11-15 [1] CRAN (R 3.6.1) tidypredict 0.4.5 2020-02-10 [1] CRAN (R 3.6.3) tidyr * 1.0.2 2020-01-24 [1] CRAN (R 3.6.2) tidyselect 1.0.0 2020-01-27 [1] CRAN (R 3.6.2) tidytext 0.2.3 2020-03-04 [1] CRAN (R 3.6.3) timeDate 3043.102 2018-02-21 [1] CRAN (R 3.6.0) tokenizers 0.2.1 2018-03-29 [1] CRAN (R 3.6.1) tune * 0.0.1 2020-01-02 [1] Github (tidymodels/tune@e044702) vctrs 0.2.4 2020-03-10 [1] CRAN (R 3.6.3) withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.1) workflows * 0.1.1 2020-03-17 [1] CRAN (R 3.6.3) xfun 0.12 2020-01-13 [1] CRAN (R 3.6.2) xml2 1.2.5 2020-03-11 [1] CRAN (R 3.6.3) xtable 1.8-4 2019-04-21 [1] CRAN (R 3.6.1) xts 0.12-0 2020-01-19 [1] CRAN (R 3.6.2) yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.2) yardstick * 0.0.6 2020-03-17 [1] CRAN (R 3.6.3) zeallot 0.1.0 2018-01-28 [1] CRAN (R 3.6.1) zoo 1.8-7 2020-01-10 [1] CRAN (R 3.6.2) [1] C:/Users/tbats/Documents/R/win-library/3.6 [2] C:/Program Files/R/R-3.6.2/library ```


ercbk commented 4 years ago

PuTTY has a section "Sending of null packets to keep session active" with setting, "Seconds between keepalives," There's also "Low level TCP connection options" with setting, "Enable TCP keepalives." Wondering if this would help and if there's a way to take advantage of it through plink.

HenrikBengtsson commented 4 years ago

Maybe you could use the progressr package to signal progress updates from within your future_map() .f expression to keep the connection alive. Such progression updates will be basically relayed back to the main R session as they are produces, cf. https://cran.r-project.org/web/packages/progressr/vignettes/progressr-intro.html. It's a bit of a hack, but at least this will give you some clues on whether it is a SSH timeout or not.

ercbk commented 4 years ago

I'm not ready to call victory yet, but I have a strong candidate for a solution. It turns out you can use those settings in PuTTY non-interactively with plink. I ran the basic nested code in the original post (Sys.sleep(160)) with the config below, and it finished after 16 min with no problems.

cl <- future::makeClusterPSOCK(

   ## Public IP number of EC2 instance
   public_ip,

   ## User name (always 'ubuntu')
   user = "ubuntu",

   ## Use private SSH key registered with AWS
   rshcmd = c("plink", "-ssh", "-load", "futureSettings", "-i", ssh_private_key_file),

   rscript_args = c("-e", shQuote(".libPaths('/home/rstudio/R/x86_64-pc-linux-gnu-library/3.6')")
   ),
   verbose = TRUE
)

The -load flag loads a saved PuTTY session (e.g. futureSettings) which can include... probably everything listed in the rshcmd argument (and maybe the user argument too), but most importantly, the setting(s) that sends those null packets every so often. Here are the steps to do it.

  1. Open PuTTY
  2. Connections (left panel) --> Sending of null packets to keep session active (main body) --> Seconds between keepalives --> enter how many seconds you want (I chose 60) --> tick box for Enable TCP keepalives
  3. Session (left panel) --> Saved Sessions (main body) --> In the narrow box, enter the name you want to save the settings under (I chose futureSettings) --> click save
  4. Exit by clicking the X or Cancel

Also, I did try progressr with future_map, and it kind of worked but not as intended. The progress bar displayed and completed, but all at once and at the beginning portion of the execution.

Still have to try the PuTTY settings with the nested-cv script. I'll update this post once that happens.

Update: It works on my nested-cv script! My previous record for longest, successfully completed run was 4.05 minutes. I've now had successful runs of 12 minutes and 21.07 minutes. It's lookin' good. I'm going to go ahead and close the issue. Thank you for the package, Henrik.