HenrikBengtsson / parallelly

R package: parallelly - Enhancing the 'parallel' Package
https://parallelly.futureverse.org
130 stars 7 forks source link

makeClusterPSOCK(..., rscript_envs = ...) - more clever #8

Open HenrikBengtsson opened 4 years ago

HenrikBengtsson commented 4 years ago

Instead of doing this via -e "Sys.setenv('<name>'='<value>')" options, can't we do:

> Sys.setenv(FOO="bar")
> system2("Rscript", args = c("-e", shQuote("Sys.getenv('FOO')")), stdout=TRUE)
[1] "[1] \"bar\""
> my_undo_env_fcn() 

This way we can set env vars that need to be set very early on in the R startup process in order to take place, e.g. TMPDIR.

I've verified that the above work on Linux and Windows. Maybe worth adding an internal with_env() to make sure things are properly undone for the main R session.

HenrikBengtsson commented 4 years ago

This will work for the local machine. But, what about remote sessions over, say, SSH?

HenrikBengtsson commented 4 years ago

Ideally, R should support this, cf. https://github.com/HenrikBengtsson/Wishlist-for-R/issues/110

HenrikBengtsson commented 4 years ago

Per https://github.com/HenrikBengtsson/future/issues/392, we now support:

cl <- makeClusterPSOCK(..., rscript = c("LD_LIBRARY_PATH=/path/to", "Rscript"))

EDIT: Note that this does not work on MS Windows.

HenrikBengtsson commented 2 years ago

Regarding not being able to pass environment variables sooner in the R startup process:

So, Rscript ... expands to R --no-echo --no-restore ..., and, contrary to Rscript, we can pass environment variables to R as R PI="3.14" ... --args .... We could do this shuffling internally in makeNodePSOCK(). Since R doesn't take option --default-packages=<pkgs>, we need to pass those via R_DEFAULT_PACKAGES=... and we need to make sure to inject an --args too.

Example: Local worker

Instead of:

> cl <- parallelly::makeClusterPSOCK(1L, rscript_envs = c(PI="3.14"), dryrun = TRUE)
----------------------------------------------------------------------
Manually, start worker #1 on local machine 'localhost' with:

  "C:/PROGRA~1/R/R-41~1.0/bin/x64/Rscript" --default-packages=datasets,utils,grDevices,graphics,stats,methods -e "options(socketOptions = \"no-delay\")" -e "Sys.setenv(\"PI\"=\"3.14\")" -e "workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()" MASTER=localhost PORT=11876 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential```

we could have it do:

```r
> cl <- parallelly::makeClusterPSOCK(1L, rscript_envs = c(PI="3.14"), dryrun = TRUE)
----------------------------------------------------------------------
Manually, start worker #1 on local machine 'localhost' with:

  "C:/PROGRA~1/R/R-41~1.0/bin/x64/R" --no-echo --no-restore R_DEFAULT_PACKAGES="datasets,utils,grDevices,graphics,stats,methods" PI="3.14" -e "options(socketOptions = \"no-delay\")" -e "workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()" --args MASTER=localhost PORT=11876 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential```

Example: Remote worker

Instead of:

> cl <- parallelly::makeClusterPSOCK("remote.example.org", rscript_envs = c(PI="3.14"), dryrun = TRUE)

----------------------------------------------------------------------
Manually, (i) login into external machine 'remote.example.org':

  '/usr/bin/ssh' -R 11121:localhost:11121 remote.example.org

and (ii) start worker #1 from there:

  'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'options(socketOptions = "no-delay")' -e 'Sys.setenv("PI"="3.14")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11121 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential

Alternatively, start worker #1 from the local machine by combining both step in a single call:

  '/usr/bin/ssh' -R 11121:localhost:11121 remote.example.org "'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'options(socketOptions = \"no-delay\")' -e 'Sys.setenv(\"PI\"=\"3.14\")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11121 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential"

we could do:

> cl <- parallelly::makeClusterPSOCK("remote.example.org", rscript_envs = c(PI="3.14"), dryrun = TRUE)

----------------------------------------------------------------------
Manually, (i) login into external machine 'remote.example.org':

  '/usr/bin/ssh' -R 11121:localhost:11121 remote.example.org

and (ii) start worker #1 from there:

  'R' --no-echo --no-restore R_DEFAULT_PACKAGES='datasets,utils,grDevices,graphics,stats,methods' PI='3.14' -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' --args MASTER=localhost PORT=11121 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential

Alternatively, start worker #1 from the local machine by combining both step in a single call:

  '/usr/bin/ssh' -R 11121:localhost:11121 remote.example.org "'R' --no-echo --no-restore R_DEFAULT_PACKAGES='datasets,utils,grDevices,graphics,stats,methods' PI='3.14' -e 'options(socketOptions = \"no-delay\")' -e 'Sys.setenv(\"PI\"=\"3.14\")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' --args MASTER=localhost PORT=11121 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential"

Note that the above R PI="3.14" ... is for MS Windows. On all other platforms, we need to do PI="3.14" R ..., which means we equally well can do PI="3.14" Rscript ... there.

HenrikBengtsson commented 2 years ago

In parallelly (>= 1.29.0-9003), we can now do (Issue #75):

> cl <- parallelly::makeClusterPSOCK(1L, rscript = file.path(R.home("bin"), "R"), rscript_args = c("--no-echo", "--no-restore", "*", "--args"), dryrun = TRUE)
----------------------------------------------------------------------
Manually, start worker #1 on local machine 'localhost' with:

  '/home/hb/software/R-devel/R-4-1-branch/lib/R/bin/R' --no-echo --no-restore --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' --args MASTER=localhost PORT=11920 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential

Now, contrary to Rscript, R does not support --default-packages=... so that's ignored and we get a warning;

> cl <- parallelly::makeClusterPSOCK(1L, rscript = file.path(R.home("bin"), "R"), rscript_args = c("--no-echo", "--no-restore", "*", "--args"))
WARNING: unknown option '--default-packages=datasets,utils,grDevices,graphics,stats,methods'

> cl
Socket cluster with 1 nodes where 1 node is on host 'localhost' (R version 4.1.2 Patched (2021-11-01 r81123), platform x86_64-pc-linux-gnu)
HenrikBengtsson commented 2 years ago

In the develop version (commit 22993892), default packages are now set via R_DEFAULT_PACKAGES when Rscript is not used, e.g.

cl <- parallelly::makeClusterPSOCK(1L, rscript = file.path(R.home("bin"), "R"), rscript_args = c("--no-echo", "--no-restore", "*", "--args"), dryrun = TRUE)
----------------------------------------------------------------------
Manually, start worker #1 on local machine 'localhost' with:

  R_DEFAULT_PACKAGES=datasets,utils,grDevices,graphics,stats,methods '/home/hb/software/R-devel/R-4-1-branch/lib/R/bin/R' --no-echo --no-restore -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' --args MASTER=localhost PORT=11606 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential

This avoids above warning.

Currently, this R_DEFAULT_PACKAGES workaround is only applied for locally launched cluster nodes. For remote workers, we'll get a warning that it's not supported.

HenrikBengtsson commented 2 years ago

Currently, this R_DEFAULT_PACKAGES workaround is only applied for locally launched cluster nodes. For remote workers, we'll get a warning that it's not supported.

Update: New argument rscript_sh is used to infer whether a cluster node is launched on MS Windows or not. This allowed me to rely on R_DEFAULT_PACKAGES also for remote workers.

HenrikBengtsson commented 2 years ago

Argh... so, on MS Windows, R does not escape quotes at the CLI like Rscript and Rterm, cf. https://stat.ethz.ch/pipermail/r-devel/2021-December/081371.html.

So, on MS Windows, above R workaround has to use Rterm instead.