Closed dmolitor closed 2 years ago
Have you verified that you can launch Rscript
over SSH, cf. https://parallelly.futureverse.org/reference/makeClusterPSOCK.html#failing-to-set-up-remote-workers?
Yep! For example:
{local}$ ssh -l ubuntu -i C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem 35.85.231.185 Rscript --version
R scripting front-end version 4.1.2 (2021-11-01)
{local}$
Good. And what if you add the reverse SSH tunneling for the port
you specify?
ssh -R 11274:127.0.0.1:11274 -l ubuntu -i C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem 35.85.231.185 Rscript --version
And what if using the full set of CLI options as shown in the manual = TRUE
output?
"C:\Windows\System32\OpenSSH\ssh.exe" -R 11274:127.0.0.1:11274 -l ubuntu -o StrictHostKeyChecking=no -o IdentitiesOnly=yes -i C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem 35.85.231.185 Rscript --version
Yep, adding the reverse SSH tunneling as well as the whole 9 yards of CLI options both correctly return
R scripting front-end version 4.1.2 (2021-11-01)
Thanks. What happens if you set:
options(parallelly.makeNodePSOCK.socketOptions = "NULL") ## the quotes around NULL are critical
first?
That does it! Here's the output when setting the option prior to running:
# Set options accordingly
options(parallelly.makeNodePSOCK.socketOptions = "NULL")
# Attempt to create same cluster automatically
cl <- parallelly::makeClusterPSOCK(
worker = "35.85.231.185",
port = 11274,
user = "ubuntu",
rshopts = c("-o", "StrictHostKeyChecking=no",
"-o", "IdentitiesOnly=yes",
"-i", "C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem"),
manual = FALSE,
verbose = TRUE,
outfile = "",
## Options below to make it not retry forever
connectTimeout = 45,
tries = 1
)
# Print cluster info
cl
#> Socket cluster with 1 nodes where 1 node is on host '35.85.231.185' (R version 4.1.2 (2021-11-01), platform x86_64-pc-linux-gnu)
# Create parallelization strategy with future
future::plan(future::cluster, workers = cl)
# Run code on node
furrr::future_walk(1, function(i) print(Sys.info()))
#> sysname
#> "Linux"
#> release
#> "5.11.0-1021-aws"
#> version
#> "#22~20.04.2-Ubuntu SMP Wed Oct 27 21:27:13 UTC 2021"
#> nodename
#> "ip-172-31-53-244"
#> machine
#> "x86_64"
#> login
#> "ubuntu"
#> user
#> "ubuntu"
#> effective_user
#> "ubuntu"
# Kill cluster
parallel::stopCluster(cl)
And then just to make sure I'm not insane 😅:
# Confirm it still fails without setting the option???
# Attempt to create same cluster automatically
cl <- parallelly::makeClusterPSOCK(
worker = "35.85.231.185",
port = 11274,
user = "ubuntu",
rshopts = c("-o", "StrictHostKeyChecking=no",
"-o", "IdentitiesOnly=yes",
"-i", "C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem"),
manual = FALSE,
verbose = TRUE,
outfile = "",
## Options below to make it not retry forever
connectTimeout = 45,
tries = 1
)
#> Error in socketConnection(localhostHostname, port = port, server = TRUE, : Failed to launch and connect to R worker on remote machine '35.85.231.185' from local machine 'RIPL-89672'.
#> * The error produced by socketConnection() was: 'reached elapsed time limit' (which suggests that the connection timeout of 45 seconds (argument 'connectTimeout') kicked in)
#> * The localhost socket connection that failed to connect to the R worker used port 11274 using a communication timeout of 2592000 seconds and a connection timeout of 45 seconds.
#> * Worker launch call: "C:\Windows\System32\OpenSSH\ssh.exe" -R 11274:127.0.0.1:11274 -l ubuntu -o StrictHostKeyChecking=no -o IdentitiesOnly=yes -i C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem 35.85.231.185 "\"Rscript\" --default-packages=datasets,utils,grDevices,graphics,stats,methods -e \"options(socketOptions = \\"no-delay\\")\" -e \"workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()\" MASTER=localhost PORT=11274 OUT= TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=45 SETUPSTRATEGY=sequential".
#> * Troubleshooting suggestions:
#> - Suggestion #1: On Windows, output from worker when using 'outfile=NULL' is only visible when running R from a terminal (not a GUI).
#> - Suggestion #2: Set 'rshlogfile=TRUE' to enable logging for 'C:\Windows\System32\OpenSSH\ssh.exe'.
#> - Suggestion #3: The 'rshcmd' ('C:\Windows\System32\OpenSSH\ssh.exe' [type='ssh', version='OpenSSH_for_Windows_8.1p1, LibreSSL 3.0.2']) used may not support reverse tunneling (revtunnel = TRUE). See ?parallelly::makeClusterPSOCK for alternatives.
#>
#>
#> * Number of attempts: 1 (15s delay)
Thanks a bunch! Can you give me some insight as to what is happening here, because to be honest I have pretty much no clue what setting that option changes?
Great.
It's a bug in parallelly 1.29.0 (the most recent version) that kicks in when one launches remote workers from an MS Windows machine. In parallelly 1.29.0, makeClusterSOCK()
sets option socketConnection="no-delay"
(default) on the workers. It does so by launching the remote worker using:
Rscript ... -e 'options(socketConnection="no-delay")' ...
Now, when passing that R expression via SSH, one need to make sure to use proper quotes, because the whole Rscript ...
call is quoted by itself, e.g.
ssh ... "Rscript ... -e 'options(socketConnection=\"no-delay\")' ..."
This works correctly when launching remote workers from Linux/macOS, but when done from MS Windows, we get:
ssh ... "Rscript ... -e \"options(socketConnection=\\"no-delay\\")\" ..."
You can see this happening for you in the output of Alternatively, start worker #1 ...
.
I use shQuote()
to juggle the shell quotes. However, the problem is that one need use shQuote(..., type = "sh")
when launching remote Linux workers, which is the default when calling shQuote()
from a Linux machine. But on MS Windows, it defaults to shQuote(..., type = "cmd")
, which causes this issue. So, I forgot to validate that this is done to test that from MS Windows.
I'll fix this for the next release of parallelly. Thanks for reporting - it'll save others the same struggles you went through.
Forgot to say, when you set options(parallelly.makeNodePSOCK.socketOptions = "NULL")
, it will cause that whole -e 'options(socketConnection="no-delay")'
to be dropped. That way, we avoid this bug.
Got it, that seems like a perfectly fine tradeoff. Thanks for your super quick response to this, as it rescued me from a lot of pain and headscratching 😄.
P.S. the future
and family packages are great...thanks for all your time on them!
Would you mind trying with the develop version of parallelly;
remotes::install_github("HenrikBengtsson/parallelly", ref="develop")
and see if it works without the "option" workaround?
Output when using the option workaround with the dev version:
# What's the parallelly version?
packageVersion("parallelly")
#> [1] '1.29.0.9002'
# Set options accordingly
options(parallelly.makeNodePSOCK.socketOptions = "NULL")
# Attempt to create same cluster automatically
cl <- parallelly::makeClusterPSOCK(
worker = "35.85.231.185",
port = 11274,
user = "ubuntu",
rshopts = c("-o", "StrictHostKeyChecking=no",
"-o", "IdentitiesOnly=yes",
"-i", "C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem"),
manual = FALSE,
verbose = TRUE,
outfile = "",
## Options below to make it not retry forever
connectTimeout = 45,
tries = 1
)
# Print cluster info
cl
#> Socket cluster with 1 nodes where 1 node is on host '35.85.231.185' (R version 4.1.2 (2021-11-01), platform x86_64-pc-linux-gnu)
# Create parallelization strategy with future
future::plan(future::cluster, workers = cl)
# Run code on node
furrr::future_walk(1, function(i) print(Sys.info()))
#> sysname
#> "Linux"
#> release
#> "5.11.0-1021-aws"
#> version
#> "#22~20.04.2-Ubuntu SMP Wed Oct 27 21:27:13 UTC 2021"
#> nodename
#> "ip-172-31-53-244"
#> machine
#> "x86_64"
#> login
#> "ubuntu"
#> user
#> "ubuntu"
#> effective_user
#> "ubuntu"
# Kill cluster
parallel::stopCluster(cl)
And without setting the option? Make sure to try in a fresh R session.
Whoops, for some reason I read that as with the option setting. Here's what you actually asked for:
# What's the parallelly version?
packageVersion("parallelly")
#> [1] '1.29.0.9002'
# Attempt to create same cluster automatically
cl <- parallelly::makeClusterPSOCK(
worker = "35.85.231.185",
port = 11274,
user = "ubuntu",
rshopts = c("-o", "StrictHostKeyChecking=no",
"-o", "IdentitiesOnly=yes",
"-i", "C:/Users/DanielMolitor/Documents/ripl/ssh-auth/Dan.pem"),
manual = FALSE,
verbose = TRUE,
outfile = "",
## Options below to make it not retry forever
connectTimeout = 45,
tries = 1
)
# Print cluster info
cl
#> Socket cluster with 1 nodes where 1 node is on host '35.85.231.185' (R version 4.1.2 (2021-11-01), platform x86_64-pc-linux-gnu)
# Create parallelization strategy with future
future::plan(future::cluster, workers = cl)
# Run code on node
furrr::future_walk(1, function(i) print(Sys.info()))
#> sysname
#> "Linux"
#> release
#> "5.11.0-1021-aws"
#> version
#> "#22~20.04.2-Ubuntu SMP Wed Oct 27 21:27:13 UTC 2021"
#> nodename
#> "ip-172-31-53-244"
#> machine
#> "x86_64"
#> login
#> "ubuntu"
#> user
#> "ubuntu"
#> effective_user
#> "ubuntu"
# Kill cluster
parallel::stopCluster(cl)
Works fine! Rendering everything via reprex::reprex
so should be fresh R session.
Perfect. Thanks for confirming.
Problem Background
Over the past several days I have been attempting to utilize
parallelly::makeClusterPSOCK
to create a remote cluster. I have been utterly failing and am unable to trace the root cause. The basic problem is as follows: when setting themanual
argument toTRUE
I can successfully ssh into my remote machine and execute the$ Rscript ... parallel:::.workRSOCK() ...
command which then succesfully connects the remote worker to the local host and allows me to execute code just fine. However, when I change themanual
argument toFALSE
, the process times out and fails to make the connection. The reprex outlining this full problem, as well as my session info, is attached below.What I've attempted
To attempt to address this, I have looked through a number of related threads including this SO thread, #14, #7, among others, and have failed to find anything that solves the issue. I have also tried using PuTTY instead of the Windows default OpenSSH with the same exact result.
Help
Deep down I feel like this may not be a
parallelly
bug at all (I apologize if that's the case), but I'm at a complete loss and any help would be greatly appreciated.