HenrikBengtsson / Wishlist-for-R

Features and tweaks to R that I and others would love to see - feel free to add yours!
https://github.com/HenrikBengtsson/Wishlist-for-R/issues
GNU Lesser General Public License v3.0
134 stars 4 forks source link

BUG-ISH: parallel::makeCluster(..., type="PSOCK") overrides ~/.ssh/config username specifications #31

Open HenrikBengtsson opened 8 years ago

HenrikBengtsson commented 8 years ago

Quick summary

The PSOCK functionality of the parallel package is currently always passing an -l <username> option in the ssh call. If user doesn't specify a username (via argument user), then it will fall back to using the default username (= Sys.info()[["user"]]). The problem with this is that it overrides any username specifications in a ~/.ssh/config file. I propose a patch to parallel that only passes the -l <username> option, if user is explictly set. If not specified, it relies on ssh to do the correct thing.

Background

Using the parallel package, we can connect to a remote machine (or cluster) by using:

library("parallel")
myip <- readLines("https://myexternalip.com/raw")
cl <- makeCluster("remote.myserver.org", user="henrik", master=myip, homogeneous=FALSE)

The default is that the connection is set up via an ssh call. To verify that we can connect to the remote machine from our current location, we can use:

ssh -l henrik remote.myserver.org

(To verify that the remote machine in turn can connect back, which is also required, is a different topic and not relevant to this issue).

Issue

If one uses different usernames locally and remotely, it is convenient to configure the default username for a given server by editing ~/.ssh/config, e.g.

$ more ~/.ssh/config

Host remote.myserver.org
  User henrik

So, if my local username is hb, i.e.

$ echo $USER
hb

with the above ~/.ssh/config file, I no longer have to specify -l henrik, but I can just do:

ssh remote.myserver.org

regardless of my local username. (Without a ~/.ssh/config file, ssh would fall back to use my local username (as in -l $USER). I will return to this in my proposal at the end.)

Unfortunately, this does not work with parallel::makeCluster(). In other words, it is not possible to do just:

library("parallel")
myip <- readLines("https://myexternalip.com/raw")
cl <- makeCluster("remote.myserver.org", master=myip, homogeneous=FALSE)

Troubleshooting

The reason that the username in ~/.ssh/config is ignored is that the parallel package will override it because it always passes option -l <local username> to ssh. For example,

> library("parallel")
> myip <- readLines("https://myexternalip.com/raw")
> trace(system, tracer=quote(print(command)))
> cl <- makeCluster("remote.myserver.org", master=myip, homogeneous=FALSE)
Tracing system(cmd, wait = FALSE) on entry 
[1] "ssh -l hb remote.myserver.org \"Rscript --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=1.2.3.4 PORT=11633 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE\""

which matches my local username:

> Sys.info()[["user"]]
[1] "hb

If we dig into the code of the parallel package, we find the following piece of code in parallel:::newPSOCKnode():

        if (machine != "localhost") {
            ## This assumes an ssh-like command
            rshcmd <- getClusterOption("rshcmd", options)
            user <- getClusterOption("user", options)
            ## this assume that rshcmd will use a shell, and that is
            ## the same shell as on the master.
            cmd <- shQuote(cmd)
            cmd <- paste(rshcmd, "-l", user, machine, cmd)
        }

where

> options <- parallel:::defaultClusterOptions
> options$user
[1] "hb"

Further inspection shows that getClusterOption("user", options) will return the above default options$user value if not explicitly specified as an argument to makeCluster() (see initial example).

The parallel:::defaultClusterOptions object is initialized with user = Sys.info()[["user"]] as seen in parallel:::initDefaultClusterOptions() which is called when the parallel package is loaded.

Workaround

Since parallel:::newPSOCKnode() will always inject -l <username> it is not clear how to circumvent this. For instance, trying to trick it with parallel:::defaultClusterOptions$user <- NULL will not work.

Another alternative would be to use a custom rshcmd script that discards the -l username options. However, that would also discard user when it is legitimitely specified as an argument to makeCluster(). It would also be tricky to write a solution that would work out-of-the-box on all operating systems.

The only solution I see is to parse ~/.ssh/config to check whether it specifies a different remote username than the default local username. If it does, then user should be set to this remote username specified in ~/.ssh/config. That could kind of work, but it would require a robust parser.

In summary, I don't see a neat workaround for this problem.

Suggestion

If option -l <username> is not specified in the call to ssh, it will fall back to use whatever is specified in the ~/.ssh/config file and otherwise it will fall back to use the local username (basically -l $USER). In other words, there is no real reason for the parallel package to do this work instead of ssh. If the parallel package would only pass -l <username> if user is explicitly set, then and User specifications in ~/.ssh/config would also be acknowledged.

Patch

Here's an SVN patch (svn diff src/library/parallel/R/snow*.R) that would achieve this:

Index: src/library/parallel/R/snow.R
===================================================================
--- src/library/parallel/R/snow.R   (revision 71304)
+++ src/library/parallel/R/snow.R   (working copy)
@@ -97,7 +97,7 @@
                     outfile = "/dev/null",
                     rscript = rscript,
                     rscript_args = character(),
-                    user = Sys.i[["user"]],
+                    user = NULL,
                     rshcmd = "ssh",
                     manual = FALSE,
                     methods = TRUE,
Index: src/library/parallel/R/snowSOCK.R
===================================================================
--- src/library/parallel/R/snowSOCK.R   (revision 71304)
+++ src/library/parallel/R/snowSOCK.R   (working copy)
@@ -75,7 +75,8 @@
             ## this assume that rshcmd will use a shell, and that is
             ## the same shell as on the master.
             cmd <- shQuote(cmd)
-            cmd <- paste(rshcmd, "-l", user, machine, cmd)
+            opts <- if (is.null(user)) NULL else paste("-l", user)
+            cmd <- paste(rshcmd, opts, machine, cmd)
         }

         if (.Platform$OS.type == "windows") {

Proof of concept

Here's a proof-of-concept hack that allows you to test the above patch without having to rebuild R from source:

## Tweak parallel:::newPSOCKnode()
newPSOCKnode <- parallel:::newPSOCKnode
expr <- body(newPSOCKnode)
code <- deparse(expr)
pattern <- 'cmd <- paste(rshcmd, "-l", user, machine, cmd)'
replacement <- 'opts <- if (is.null(user)) NULL else paste("-l", user); cmd <- paste(rshcmd, opts, machine, cmd)'
code <- gsub(pattern, replacement, code, fixed=TRUE)
expr <- parse(text=code)
body(newPSOCKnode) <- expr
assignInNamespace("newPSOCKnode", newPSOCKnode, ns=getNamespace("parallel"))

## Remove default 'user' option
opts <- parallel:::defaultClusterOptions
opts$user <- NULL
assignInNamespace("defaultClusterOptions", opts, ns=getNamespace("parallel"))
stopifnot(is.null(parallel:::defaultClusterOptions$user))
> library("parallel")
> myip <- readLines("https://myexternalip.com/raw")
> trace(system, tracer=quote(print(command)))

## Argument 'user' still works
> cl <- makeCluster("remote.myserver.org", user="henrik", master=myip, homogeneous=FALSE)
Tracing system(cmd, wait = FALSE) on entry 
[1] "ssh -l henrik remote.myserver.org \"Rscript --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=1.2.3.4 PORT=11266 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE\""

## But if not specified it is not passed
> cl <- makeCluster("remote.myserver.org", master=myip, homogeneous=FALSE)
Tracing system(cmd, wait = FALSE) on entry 
[1] "ssh remote.myserver.org \"Rscript --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'parallel:::.slaveRSOCK()' MASTER=1.2.3.4 PORT=11633 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE\""
HenrikBengtsson commented 8 years ago

Is it safe to drop defaultClusterOptions$user / set it to NULL?

There are no other usages of defaultClusterOptions$user in core R that I'm aware of. I grep:ed all of the source code as below and I could only find that it's used within parallel:::newPSOCKnode().

$ grep -r --include='*.R' -v -E '^[[:space:]]*#.*' | grep -E "[^#]*([\$(\"' ]+)user([)\"' ]+|$)"
utils/R/edit.R:        if(factor.mode != mode(out[[i]])) next # user might have switched mode
utils/R/databrowser.R:                  si[c("user","nodename","sysname")]})))
utils/R/tar.R:        warning(gettextf("invalid uid value replaced by that for user 'nobody'", uid),
utils/R/tar.R:        warning(gettextf("invalid gid value replaced by that for user 'nobody'", uid),
utils/R/SweaveDrivers.R:        leading <- 1L    # How many lines get the user prompt
graphics/R/legend.R:    text.width <- max(abs(strwidth(legend, units="user",
graphics/R/legend.R:    ymax   <- yc * max(1, strheight(legend, units="user", cex=cex)/yc)
graphics/R/legend.R:        && (abs(tw <- strwidth(title, units="user", cex=cex) + 0.5*xchar)) > abs(w)) {
graphics/R/plot.R:            "user", "inches", "", "", "npc")
graphics/R/plot.R:grconvertX <- function(x, from = "user", to = "user")
graphics/R/plot.R:grconvertY <- function(y, from = "user", to = "user")
graphics/R/pairs.R:                            l.wid <- strwidth(labels, "user")
graphics/R/strwidth.R:    function(s, units = "user", cex = NULL, font = NULL, vfont = NULL,...)
graphics/R/strwidth.R:                       pmatch(units, c("user", "figure", "inches")),
graphics/R/strwidth.R:    function(s, units = "user", cex = NULL, font = NULL, vfont = NULL, ...)
graphics/R/strwidth.R:                       pmatch(units, c("user", "figure", "inches")),
base/R/time.R:        c(gettext("user"), gettext("system"), gettext("elapsed"))
parallel/R/snowSOCK.R:            user <- getClusterOption("user", options)
parallel/R/snow.R:                    user = Sys.i[["user"]],
parallel/R/unix/mclapply.R:            warning("all scheduled cores encountered errors in user code")
parallel/R/unix/mclapply.R:                                     "scheduled core %s encountered error in user code, all values of the job will be affected",
parallel/R/unix/mclapply.R:                                     "scheduled cores %s encountered errors in user code, all values of the jobs will be affected"),
tools/R/build.R:        user <- Sys.info()["user"]
tools/R/build.R:        if(user == "unknown") user <- Sys.getenv("LOGNAME")
tools/R/build.R:                    user)
methods/R/RClassUtils.R:        if(!identical(default, value)) # user supplied default
methods/R/RClassUtils.R:                if(!identical(default, value)) # user supplied default
stats/R/spectrum.R:    if(!is.null(spans)) # allow user to mistake order of args
HenrikBengtsson commented 4 years ago

As of R-devel (rev 79418; 2020-11-12), parallel::makePSOCKcluster() now supports disabling SSH user by specifying user=NULL, e.g.

> cl <- parallel::makePSOCKcluster("example.org")             ## defaults to user=Sys.info()[["user"]]
> cl <- parallel::makePSOCKcluster("example.org", user=NULL)  ## drops ssh option -l <user>
HenrikBengtsson commented 4 years ago

I'd say, what can be improved here is to have user = NULL be the default. I argue that it is a backward compatible change.

Currently, parallel:::initDefaultClusterOptions() sets the default:

                    user = Sys.i[["user"]],

Now, if we don't specify the command-line option -l user, the SSH client will use

  1. the username in the entry in ~/.ssh/config, and if not set,
  2. the username $USER

I think Sys.info()[["user"]] is the same as $USER, which means that defaulting to

                    user = Sys.i[["user"]],

is not needed. BTW, user is only used for remote shells.

HenrikBengtsson commented 3 years ago

I've posted https://bugs.r-project.org/bugzilla/show_bug.cgi?id=18042 to make user = NULL the default.