HenrikBengtsson / parallelly

R package: parallelly - Enhancing the 'parallel' Package
https://parallelly.futureverse.org
128 stars 7 forks source link

makeClusterPSOCK(): `manual = TRUE` works; `manual = FALSE` causes process timeout - Windows 10 #74

Closed dmolitor closed 2 years ago

dmolitor commented 2 years ago

Problem Background

Over the past several days I have been attempting to utilize parallelly::makeClusterPSOCK to create a remote cluster. I have been utterly failing and am unable to trace the root cause. The basic problem is as follows: when setting the manual argument to TRUE I can successfully ssh into my remote machine and execute the $ Rscript ... parallel:::.workRSOCK() ... command which then succesfully connects the remote worker to the local host and allows me to execute code just fine. However, when I change the manual argument to FALSE, the process times out and fails to make the connection. The reprex outlining this full problem, as well as my session info, is attached below.

# Manually Launch Cluster -------------------------------------------------

# Manually create cluster
cl <- parallelly::makeClusterPSOCK(
  worker = "35.85.231.185",
  port = 11274,
  user = "ubuntu",
  rshopts = c("-o", "StrictHostKeyChecking=no",
              "-o", "IdentitiesOnly=yes",
              "-i", "C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem"),
  manual = TRUE,
  verbose = TRUE,
  outfile = ""
)
#> ----------------------------------------------------------------------
#> Manually, (i) login into external machine '35.85.231.185':
#> 
#>   "C:\Windows\System32\OpenSSH\ssh.exe" -R 11274:127.0.0.1:11274 -l ubuntu -o StrictHostKeyChecking=no -o IdentitiesOnly=yes -i C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem 35.85.231.185
#> 
#> and (ii) start worker #1 from there:
#> 
#>   "Rscript" --default-packages=datasets,utils,grDevices,graphics,stats,methods -e "options(socketOptions = \"no-delay\")" -e "workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()" MASTER=localhost PORT=11274 OUT= TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential
#> 
#> Alternatively, start worker #1 from the local machine by combining both step in a single call:
#> 
#>   "C:\Windows\System32\OpenSSH\ssh.exe" -R 11274:127.0.0.1:11274 -l ubuntu -o StrictHostKeyChecking=no -o IdentitiesOnly=yes -i C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem 35.85.231.185 "\"Rscript\" --default-packages=datasets,utils,grDevices,graphics,stats,methods -e \"options(socketOptions = \\"no-delay\\")\" -e \"workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()\" MASTER=localhost PORT=11274 OUT= TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential"
#> starting worker pid=34862 on localhost:11274 at 22:01:28.998

# Print cluster info
cl
#> Socket cluster with 1 nodes where 1 node is on host '35.85.231.185' (R version 4.1.2 (2021-11-01), platform x86_64-pc-linux-gnu)
# Create parallelization strategy with future
future::plan(future::cluster, workers = cl)
# Run code on node
furrr::future_walk(1, function(i) print(Sys.info()))
#>                                               sysname 
#>                                               "Linux" 
#>                                               release 
#>                                     "5.11.0-1021-aws" 
#>                                               version 
#> "#22~20.04.2-Ubuntu SMP Wed Oct 27 21:27:13 UTC 2021" 
#>                                              nodename 
#>                                    "ip-172-31-53-244" 
#>                                               machine 
#>                                              "x86_64" 
#>                                                 login 
#>                                              "ubuntu" 
#>                                                  user 
#>                                              "ubuntu" 
#>                                        effective_user 
#>                                              "ubuntu"
# Kill cluster
parallel::stopCluster(cl)

# Automatically Launch Cluster --------------------------------------------

# Attempt to create same cluster automatically
cl <- parallelly::makeClusterPSOCK(
  worker = "35.85.231.185",
  port = 11274,
  user = "ubuntu",
  rshopts = c("-o", "StrictHostKeyChecking=no",
              "-o", "IdentitiesOnly=yes",
              "-i", "C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem"),
  manual = FALSE,
  verbose = TRUE,
  outfile = "", 
  ## Options below to make it not retry forever
  connectTimeout = 45,
  tries = 1
)
#> Error in socketConnection(localhostHostname, port = port, server = TRUE, : Failed to launch and connect to R worker on remote machine '35.85.231.185' from local machine 'RIPL-89672'.
#>  * The error produced by socketConnection() was: 'reached elapsed time limit' (which suggests that the connection timeout of 45 seconds (argument 'connectTimeout') kicked in)
#>  * The localhost socket connection that failed to connect to the R worker used port 11274 using a communication timeout of 2592000 seconds and a connection timeout of 45 seconds.
#>  * Worker launch call: "C:\Windows\System32\OpenSSH\ssh.exe" -R 11274:127.0.0.1:11274 -l ubuntu -o StrictHostKeyChecking=no -o IdentitiesOnly=yes -i C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem 35.85.231.185 "\"Rscript\" --default-packages=datasets,utils,grDevices,graphics,stats,methods -e \"options(socketOptions = \\"no-delay\\")\" -e \"workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()\" MASTER=localhost PORT=11274 OUT= TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=45 SETUPSTRATEGY=sequential".
#>  * Troubleshooting suggestions:
#>    - Suggestion #1: On Windows, output from worker when using 'outfile=NULL' is only visible when running R from a terminal (not a GUI).
#>    - Suggestion #2: Set 'rshlogfile=TRUE' to enable logging for 'C:\Windows\System32\OpenSSH\ssh.exe'.
#>    - Suggestion #3: The 'rshcmd' ('C:\Windows\System32\OpenSSH\ssh.exe' [type='ssh', version='OpenSSH_for_Windows_8.1p1, LibreSSL 3.0.2']) used may not support reverse tunneling (revtunnel = TRUE). See ?parallelly::makeClusterPSOCK for alternatives.
#> 
#> 
#>  * Number of attempts: 1 (15s delay)

# Session Info for good measure -------------------------------------------
sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19043)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] pillar_1.6.4      compiler_4.1.2    highr_0.9         R.methodsS3_1.8.1
#>  [5] R.utils_2.11.0    tools_4.1.2       digest_0.6.28     evaluate_0.14    
#>  [9] lifecycle_1.0.1   tibble_3.1.6      R.cache_0.15.0    pkgconfig_2.0.3  
#> [13] rlang_0.4.12      reprex_2.0.1      yaml_2.2.1        parallel_4.1.2   
#> [17] xfun_0.28         fastmap_1.1.0     furrr_0.2.3       withr_2.4.2      
#> [21] styler_1.6.2      stringr_1.4.0     knitr_1.36        fs_1.5.0         
#> [25] vctrs_0.3.8       globals_0.14.0    glue_1.5.0        listenv_0.8.0    
#> [29] fansi_0.5.0       parallelly_1.29.0 rmarkdown_2.11    purrr_0.3.4      
#> [33] magrittr_2.0.1    backports_1.3.0   codetools_0.2-18  ellipsis_0.3.2   
#> [37] htmltools_0.5.2   future_1.23.0     utf8_1.2.2        stringi_1.7.5    
#> [41] crayon_1.4.2      R.oo_1.24.0

What I've attempted

To attempt to address this, I have looked through a number of related threads including this SO thread, #14, #7, among others, and have failed to find anything that solves the issue. I have also tried using PuTTY instead of the Windows default OpenSSH with the same exact result.

Help

Deep down I feel like this may not be a parallelly bug at all (I apologize if that's the case), but I'm at a complete loss and any help would be greatly appreciated.

HenrikBengtsson commented 2 years ago

Have you verified that you can launch Rscript over SSH, cf. https://parallelly.futureverse.org/reference/makeClusterPSOCK.html#failing-to-set-up-remote-workers?

dmolitor commented 2 years ago

Yep! For example:

{local}$ ssh -l ubuntu -i C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem 35.85.231.185 Rscript --version
R scripting front-end version 4.1.2 (2021-11-01)
{local}$
HenrikBengtsson commented 2 years ago

Good. And what if you add the reverse SSH tunneling for the port you specify?

ssh -R 11274:127.0.0.1:11274 -l ubuntu -i C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem 35.85.231.185 Rscript --version

And what if using the full set of CLI options as shown in the manual = TRUE output?

"C:\Windows\System32\OpenSSH\ssh.exe" -R 11274:127.0.0.1:11274 -l ubuntu -o StrictHostKeyChecking=no -o IdentitiesOnly=yes -i C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem 35.85.231.185 Rscript --version
dmolitor commented 2 years ago

Yep, adding the reverse SSH tunneling as well as the whole 9 yards of CLI options both correctly return

R scripting front-end version 4.1.2 (2021-11-01)
HenrikBengtsson commented 2 years ago

Thanks. What happens if you set:

options(parallelly.makeNodePSOCK.socketOptions = "NULL")  ## the quotes around NULL are critical

first?

dmolitor commented 2 years ago

That does it! Here's the output when setting the option prior to running:

# Set options accordingly
options(parallelly.makeNodePSOCK.socketOptions = "NULL")

# Attempt to create same cluster automatically
cl <- parallelly::makeClusterPSOCK(
  worker = "35.85.231.185",
  port = 11274,
  user = "ubuntu",
  rshopts = c("-o", "StrictHostKeyChecking=no",
              "-o", "IdentitiesOnly=yes",
              "-i", "C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem"),
  manual = FALSE,
  verbose = TRUE,
  outfile = "", 
  ## Options below to make it not retry forever
  connectTimeout = 45,
  tries = 1
)

# Print cluster info
cl
#> Socket cluster with 1 nodes where 1 node is on host '35.85.231.185' (R version 4.1.2 (2021-11-01), platform x86_64-pc-linux-gnu)
# Create parallelization strategy with future
future::plan(future::cluster, workers = cl)
# Run code on node
furrr::future_walk(1, function(i) print(Sys.info()))
#>                                               sysname 
#>                                               "Linux" 
#>                                               release 
#>                                     "5.11.0-1021-aws" 
#>                                               version 
#> "#22~20.04.2-Ubuntu SMP Wed Oct 27 21:27:13 UTC 2021" 
#>                                              nodename 
#>                                    "ip-172-31-53-244" 
#>                                               machine 
#>                                              "x86_64" 
#>                                                 login 
#>                                              "ubuntu" 
#>                                                  user 
#>                                              "ubuntu" 
#>                                        effective_user 
#>                                              "ubuntu"
# Kill cluster
parallel::stopCluster(cl)

And then just to make sure I'm not insane 😅:

# Confirm it still fails without setting the option???

# Attempt to create same cluster automatically
cl <- parallelly::makeClusterPSOCK(
  worker = "35.85.231.185",
  port = 11274,
  user = "ubuntu",
  rshopts = c("-o", "StrictHostKeyChecking=no",
              "-o", "IdentitiesOnly=yes",
              "-i", "C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem"),
  manual = FALSE,
  verbose = TRUE,
  outfile = "", 
  ## Options below to make it not retry forever
  connectTimeout = 45,
  tries = 1
)
#> Error in socketConnection(localhostHostname, port = port, server = TRUE, : Failed to launch and connect to R worker on remote machine '35.85.231.185' from local machine 'RIPL-89672'.
#>  * The error produced by socketConnection() was: 'reached elapsed time limit' (which suggests that the connection timeout of 45 seconds (argument 'connectTimeout') kicked in)
#>  * The localhost socket connection that failed to connect to the R worker used port 11274 using a communication timeout of 2592000 seconds and a connection timeout of 45 seconds.
#>  * Worker launch call: "C:\Windows\System32\OpenSSH\ssh.exe" -R 11274:127.0.0.1:11274 -l ubuntu -o StrictHostKeyChecking=no -o IdentitiesOnly=yes -i C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem 35.85.231.185 "\"Rscript\" --default-packages=datasets,utils,grDevices,graphics,stats,methods -e \"options(socketOptions = \\"no-delay\\")\" -e \"workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()\" MASTER=localhost PORT=11274 OUT= TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=45 SETUPSTRATEGY=sequential".
#>  * Troubleshooting suggestions:
#>    - Suggestion #1: On Windows, output from worker when using 'outfile=NULL' is only visible when running R from a terminal (not a GUI).
#>    - Suggestion #2: Set 'rshlogfile=TRUE' to enable logging for 'C:\Windows\System32\OpenSSH\ssh.exe'.
#>    - Suggestion #3: The 'rshcmd' ('C:\Windows\System32\OpenSSH\ssh.exe' [type='ssh', version='OpenSSH_for_Windows_8.1p1, LibreSSL 3.0.2']) used may not support reverse tunneling (revtunnel = TRUE). See ?parallelly::makeClusterPSOCK for alternatives.
#> 
#> 
#>  * Number of attempts: 1 (15s delay)

Thanks a bunch! Can you give me some insight as to what is happening here, because to be honest I have pretty much no clue what setting that option changes?

HenrikBengtsson commented 2 years ago

Great.

It's a bug in parallelly 1.29.0 (the most recent version) that kicks in when one launches remote workers from an MS Windows machine. In parallelly 1.29.0, makeClusterSOCK() sets option socketConnection="no-delay" (default) on the workers. It does so by launching the remote worker using:

Rscript ... -e 'options(socketConnection="no-delay")' ...

Now, when passing that R expression via SSH, one need to make sure to use proper quotes, because the whole Rscript ... call is quoted by itself, e.g.

ssh ... "Rscript ... -e 'options(socketConnection=\"no-delay\")' ..."

This works correctly when launching remote workers from Linux/macOS, but when done from MS Windows, we get:

ssh ... "Rscript ... -e \"options(socketConnection=\\"no-delay\\")\" ..."

You can see this happening for you in the output of Alternatively, start worker #1 ....

I use shQuote() to juggle the shell quotes. However, the problem is that one need use shQuote(..., type = "sh") when launching remote Linux workers, which is the default when calling shQuote() from a Linux machine. But on MS Windows, it defaults to shQuote(..., type = "cmd"), which causes this issue. So, I forgot to validate that this is done to test that from MS Windows.

I'll fix this for the next release of parallelly. Thanks for reporting - it'll save others the same struggles you went through.

HenrikBengtsson commented 2 years ago

Forgot to say, when you set options(parallelly.makeNodePSOCK.socketOptions = "NULL"), it will cause that whole -e 'options(socketConnection="no-delay")' to be dropped. That way, we avoid this bug.

dmolitor commented 2 years ago

Got it, that seems like a perfectly fine tradeoff. Thanks for your super quick response to this, as it rescued me from a lot of pain and headscratching 😄.

P.S. the future and family packages are great...thanks for all your time on them!

HenrikBengtsson commented 2 years ago

Would you mind trying with the develop version of parallelly;

remotes::install_github("HenrikBengtsson/parallelly", ref="develop")

and see if it works without the "option" workaround?

dmolitor commented 2 years ago

Output when using the option workaround with the dev version:

# What's the parallelly version?
packageVersion("parallelly")
#> [1] '1.29.0.9002'

# Set options accordingly
options(parallelly.makeNodePSOCK.socketOptions = "NULL")

# Attempt to create same cluster automatically
cl <- parallelly::makeClusterPSOCK(
  worker = "35.85.231.185",
  port = 11274,
  user = "ubuntu",
  rshopts = c("-o", "StrictHostKeyChecking=no",
              "-o", "IdentitiesOnly=yes",
              "-i", "C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem"),
  manual = FALSE,
  verbose = TRUE,
  outfile = "", 
  ## Options below to make it not retry forever
  connectTimeout = 45,
  tries = 1
)

# Print cluster info
cl
#> Socket cluster with 1 nodes where 1 node is on host '35.85.231.185' (R version 4.1.2 (2021-11-01), platform x86_64-pc-linux-gnu)
# Create parallelization strategy with future
future::plan(future::cluster, workers = cl)
# Run code on node
furrr::future_walk(1, function(i) print(Sys.info()))
#>                                               sysname 
#>                                               "Linux" 
#>                                               release 
#>                                     "5.11.0-1021-aws" 
#>                                               version 
#> "#22~20.04.2-Ubuntu SMP Wed Oct 27 21:27:13 UTC 2021" 
#>                                              nodename 
#>                                    "ip-172-31-53-244" 
#>                                               machine 
#>                                              "x86_64" 
#>                                                 login 
#>                                              "ubuntu" 
#>                                                  user 
#>                                              "ubuntu" 
#>                                        effective_user 
#>                                              "ubuntu"
# Kill cluster
parallel::stopCluster(cl)
HenrikBengtsson commented 2 years ago

And without setting the option? Make sure to try in a fresh R session.

dmolitor commented 2 years ago

Whoops, for some reason I read that as with the option setting. Here's what you actually asked for:

# What's the parallelly version?
packageVersion("parallelly")
#> [1] '1.29.0.9002'

# Attempt to create same cluster automatically
cl <- parallelly::makeClusterPSOCK(
  worker = "35.85.231.185",
  port = 11274,
  user = "ubuntu",
  rshopts = c("-o", "StrictHostKeyChecking=no",
              "-o", "IdentitiesOnly=yes",
              "-i", "C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem"),
  manual = FALSE,
  verbose = TRUE,
  outfile = "", 
  ## Options below to make it not retry forever
  connectTimeout = 45,
  tries = 1
)

# Print cluster info
cl
#> Socket cluster with 1 nodes where 1 node is on host '35.85.231.185' (R version 4.1.2 (2021-11-01), platform x86_64-pc-linux-gnu)
# Create parallelization strategy with future
future::plan(future::cluster, workers = cl)
# Run code on node
furrr::future_walk(1, function(i) print(Sys.info()))
#>                                               sysname 
#>                                               "Linux" 
#>                                               release 
#>                                     "5.11.0-1021-aws" 
#>                                               version 
#> "#22~20.04.2-Ubuntu SMP Wed Oct 27 21:27:13 UTC 2021" 
#>                                              nodename 
#>                                    "ip-172-31-53-244" 
#>                                               machine 
#>                                              "x86_64" 
#>                                                 login 
#>                                              "ubuntu" 
#>                                                  user 
#>                                              "ubuntu" 
#>                                        effective_user 
#>                                              "ubuntu"
# Kill cluster
parallel::stopCluster(cl)

Works fine! Rendering everything via reprex::reprex so should be fresh R session.

HenrikBengtsson commented 2 years ago

Perfect. Thanks for confirming.