HenrikBengtsson / Wishlist-for-R

Features and tweaks to R that I and others would love to see - feel free to add yours!
https://github.com/HenrikBengtsson/Wishlist-for-R/issues
GNU Lesser General Public License v3.0
132 stars 4 forks source link

parallel::makePSOCKcluster(): Add support for reverse SSH tunnels (and any optional SSH command-line options) #32

Open HenrikBengtsson opened 7 years ago

HenrikBengtsson commented 7 years ago

Quick summary

Add support for reverse SSH tunneling (-R <port>:localhost:<port>) when setting up PSOCK clusters using parallel::makeCluster(). This helps avoid firewall and port forwarding issues that appear when trying to connect to remote machines / clusters.

Basically, the proposed patch allows you to connect to remote R machines from anywhere as long as you can ssh directly to the machine.

If you have comments, suggestions, ideas and / or critique, please comment below. The plan is to collect and summarize feedback here, then to bring it up on R-devel, and eventually submit the patch to https://bugs.r-project.org/.

Background

The makeCluster() function of the parallel package can be used to run on a remote cluster. This can typically be done as:

library("parallel")
cl <- makeCluster("remote.myserver.org", user="johndoe", master="local.mymachine.org", port=11001, homogeneous=FALSE)
res <- parLapply(cl, 1:3, fun=function(x) x^2)
stopCluster(cl)

(*) If port is not specified, a random port in [11000,11999] is used.

By default this results in a connection to remote.myserver.org over SSH via an internal system() call like:

ssh -l johndoe remote.myserver.org \"Rscript --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=local.mymachine.org PORT=11001 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE\"

Issue

Now, in order for this to remote connection to be successfully set up, it is not only necessary for the ssh -l johndoe remote.myserver.org connection to work, but also for remote.myserver.org to be able to open a socket in the reverse direction back to our local machine at local.mymachine.org on port 11001. The latter part is problematic because it requires us to open up any local firewalls to allow for incoming connection to port 11001 (or anyone in range [11000,11999]). Even worse is when we're behind a local router, e.g. if we're on a notebook connected via a WiFi router. In such cases we also have to configure the router forward ("port forwarding") incoming connections to port 11001 (or anyone in range [11000,11999]) to our notebook. If two or more users try to do the same, things become complicated. This not only requires you to have access privileges to configure the local router but you most likely also have to configure the DHCP to use static IP for your notebook and for everyone else who wish to do the same. You also have to make sure you're not trying to use the same ports.

Solution

In SSH there is a concept called reverse tunneling, which basically makes it possible to set up a reverse port-to-port connection within the outgoing connection. This way there is no need to worry about the remote.myserver.org being able to connect back to your local machine. As long as you can make the outgoing SSH connection, the reverse connection should work out of the box (*).

By replacing the above SSH call with

ssh -l johndoe -R 11001:localhost:11001 remote.myserver.org \"Rscript --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11001 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE\"

the remote R worker will try open up the reverse connection on port 11001 on localhost (== the remote machine). Since reverse tunneling is used, this will be port forwarded to port 11001 on the calling machine (= your local machine).

In addition to the above, this also has the advantage of not having to know your public IP address or have dynamic DNS setup.

(*) An exception is when you use SSH tunneling in your outgoing connection to remote.myserver.org. In such cases, you might have to use more complex reverse SSH tunneling than proposed here.

Suggestion

Add support for reverse SSH tunneling, e.g.

cl <- makeCluster("remote.myserver.org", user="johndoe", revtunnel=TRUE, master="localhost", port=11001, homogeneous=FALSE)

Proposed patch

Here is a patch (svn diff src/library/parallel) that:

Index: src/library/parallel/R/snow.R
===================================================================
--- src/library/parallel/R/snow.R   (revision 71320)
+++ src/library/parallel/R/snow.R   (working copy)
@@ -97,8 +97,10 @@
                     outfile = "/dev/null",
                     rscript = rscript,
                     rscript_args = character(),
-                    user = Sys.i[["user"]],
+                    user = NULL,
                     rshcmd = "ssh",
+                    revtunnel = FALSE,
+                    rshopts = NULL,
                     manual = FALSE,
                     methods = TRUE,
                     renice = NA_integer_,
Index: src/library/parallel/R/snowSOCK.R
===================================================================
--- src/library/parallel/R/snowSOCK.R   (revision 71320)
+++ src/library/parallel/R/snowSOCK.R   (working copy)
@@ -71,11 +71,24 @@
         if (machine != "localhost") {
             ## This assumes an ssh-like command
             rshcmd <- getClusterOption("rshcmd", options)
+            opts <- NULL
+
+            ## Specify '-l user'?
             user <- getClusterOption("user", options)
+            if (!is.null(user)) opts <- c(opts, paste("-l", user))
+
+            ## Use SSH reverse tunneling?
+            revtunnel <- getClusterOption("revtunnel", options)
+            if (isTRUE(revtunnel)) opts <- c(opts, sprintf("-R %d:%s:%d", port, master, port))
+
+            ## Additional SSH options?
+            opts <- c(opts, getClusterOption("rshopts", options))
+
             ## this assume that rshcmd will use a shell, and that is
             ## the same shell as on the master.
             cmd <- shQuote(cmd)
-            cmd <- paste(rshcmd, "-l", user, machine, cmd)
+            opts <- paste(opts, collapse = " ")
+            cmd <- paste(rshcmd, opts, machine, cmd)
         }

         if (.Platform$OS.type == "windows") {
HenrikBengtsson commented 7 years ago

Updated patch for reverse SSH tunneling

Issue

When setting up multiple PSOCK R worker session on the same machine, they will all try to connect back to the same port, e.g.

> trace(system, tracer = quote(message(command)), print = FALSE)
Tracing function "system" in package "base"
[1] "system"
> library("parallel")
> cl <- makeCluster(rep("remote.myserver.org", 2L), user=NULL, master="local.mymachine.org", homogeneous=FALSE)
ssh  remote.myserver.org "Rscript --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=local.mymachine.org PORT=11120 OUT='/dev/null' TIMEOUT=2592000 XDR=TRUE"
ssh  remote.myserver.org "Rscript --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=local.mymachine.org PORT=11120 OUT='/dev/null' TIMEOUT=2592000 XDR=TRUE"

Note how both R worker processes is set up to connect back to local.mymachine.org:11900.

Now, this will currently not work with reverse-SSH-tunnel patch, because in that case we'll get that both workers (running on the same machine) will try to set up reverse SSH tunnels on the same local port:

> cl <- parallel::makeCluster(rep("remote.myserver.org", 2L), user=NULL, revtunnel=TRUE, master="localhost", homogeneous=FALSE)
ssh -R 11456:localhost:11456 remote.myserver.org "Rscript --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11456 OUT='/dev/null' TIMEOUT=2592000 XDR=TRUE"
ssh -R 11456:localhost:11456 remote.myserver.org "Rscript --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11456 OUT='/dev/null' TIMEOUT=2592000 XDR=TRUE"
Warning: remote port forwarding failed for listen port 11456

It's only the first one that will succeed, but any of the following ones (here one) will fail because the local port is already taken.

Solution

The solution is to make sure each R worker uses a unique local port local_port in SSH option -R local_port:localhost:11456. This could done by making local_port = port + (rank - 1L) as in the following updated patch:

Index: src/library/parallel/R/snow.R
===================================================================
--- src/library/parallel/R/snow.R   (revision 71456)
+++ src/library/parallel/R/snow.R   (working copy)
@@ -97,8 +97,10 @@
                     outfile = "/dev/null",
                     rscript = rscript,
                     rscript_args = character(),
-                    user = Sys.i[["user"]],
+                    user = NULL,
                     rshcmd = "ssh",
+                    revtunnel = FALSE,
+                    rshopts = NULL,
                     manual = FALSE,
                     methods = TRUE,
                     renice = NA_integer_,
Index: src/library/parallel/R/snowSOCK.R
===================================================================
--- src/library/parallel/R/snowSOCK.R   (revision 71456)
+++ src/library/parallel/R/snowSOCK.R   (working copy)
@@ -39,7 +39,7 @@
     ## build the local command for starting the worker
     env <- paste0("MASTER=", master,
                  " PORT=", port,
-                 " OUT=", outfile,
+                 " OUT=", shQuote(outfile),
                  " TIMEOUT=", timeout,
                  " XDR=", useXDR)
     arg <- "parallel:::.slaveRSOCK()"
@@ -71,11 +71,25 @@
         if (machine != "localhost") {
             ## This assumes an ssh-like command
             rshcmd <- getClusterOption("rshcmd", options)
+            opts <- NULL
+            
+            ## Specify '-l user'?
             user <- getClusterOption("user", options)
+            if (!is.null(user)) opts <- c(opts, paste("-l", user))
+
+            ## Use SSH reverse tunneling?
+            revtunnel <- getClusterOption("revtunnel", options)
+            if (isTRUE(revtunnel))
+                opts <- c(opts, sprintf("-R %d:%s:%d", port + (rank - 1L), master, port))
+
+            ## Additional SSH options?
+            opts <- c(opts, getClusterOption("rshopts", options)
+
             ## this assume that rshcmd will use a shell, and that is
             ## the same shell as on the master.
             cmd <- shQuote(cmd)
-            cmd <- paste(rshcmd, "-l", user, machine, cmd)
+            opts <- paste(opts, collapse = " ")
+            cmd <- paste(rshcmd, opts, machine, cmd)
         }

         if (.Platform$OS.type == "windows") {

Example

The above local port tweak solves the port clash:

> trace(system, tracer=quote(message(command)), print=FALSE)
Tracing function "system" in package "base"
[1] "system"
> library("parallel")
> cl <- makeCluster(rep("remote.myserver.org", 2), user=NULL, revtunnel=TRUE, master="localhost", homogeneous=FALSE)
ssh -R 11921:localhost:11921 remote.myserver.org "Rscript --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11921 OUT='/dev/null' TIMEOUT=2592000 XDR=TRUE"
ssh -R 11922:localhost:11921 remote.myserver.org "Rscript --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=localhost PORT=11921 OUT='/dev/null' TIMEOUT=2592000 XDR=TRUE"
> res <- parLapply(cl, 1:3, fun=function(x) x^2)
> unlist(res)
[1] 1 4 9
> stopCluster(cl)