Open HenrikBengtsson opened 4 years ago
This will work for the local machine. But, what about remote sessions over, say, SSH?
Ideally, R should support this, cf. https://github.com/HenrikBengtsson/Wishlist-for-R/issues/110
Per https://github.com/HenrikBengtsson/future/issues/392, we now support:
cl <- makeClusterPSOCK(..., rscript = c("LD_LIBRARY_PATH=/path/to", "Rscript"))
EDIT: Note that this does not work on MS Windows.
Regarding not being able to pass environment variables sooner in the R
startup process:
So, Rscript ...
expands to R --no-echo --no-restore ...
, and, contrary to Rscript
, we can pass environment variables to R
as R PI="3.14" ... --args ...
. We could do this shuffling internally in makeNodePSOCK()
. Since R
doesn't take option --default-packages=<pkgs>
, we need to pass those via R_DEFAULT_PACKAGES=...
and we need to make sure to inject an --args
too.
Instead of:
> cl <- parallelly::makeClusterPSOCK(1L, rscript_envs = c(PI="3.14"), dryrun = TRUE)
----------------------------------------------------------------------
Manually, start worker #1 on local machine 'localhost' with:
"C:/PROGRA~1/R/R-41~1.0/bin/x64/Rscript" --default-packages=datasets,utils,grDevices,graphics,stats,methods -e "options(socketOptions = \"no-delay\")" -e "Sys.setenv(\"PI\"=\"3.14\")" -e "workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()" MASTER=localhost PORT=11876 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential```
we could have it do:
```r
> cl <- parallelly::makeClusterPSOCK(1L, rscript_envs = c(PI="3.14"), dryrun = TRUE)
----------------------------------------------------------------------
Manually, start worker #1 on local machine 'localhost' with:
"C:/PROGRA~1/R/R-41~1.0/bin/x64/R" --no-echo --no-restore R_DEFAULT_PACKAGES="datasets,utils,grDevices,graphics,stats,methods" PI="3.14" -e "options(socketOptions = \"no-delay\")" -e "workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()" --args MASTER=localhost PORT=11876 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential```
Instead of:
> cl <- parallelly::makeClusterPSOCK("remote.example.org", rscript_envs = c(PI="3.14"), dryrun = TRUE)
----------------------------------------------------------------------
Manually, (i) login into external machine 'remote.example.org':
'/usr/bin/ssh' -R 11121:localhost:11121 remote.example.org
and (ii) start worker #1 from there:
'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'options(socketOptions = "no-delay")' -e 'Sys.setenv("PI"="3.14")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11121 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential
Alternatively, start worker #1 from the local machine by combining both step in a single call:
'/usr/bin/ssh' -R 11121:localhost:11121 remote.example.org "'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'options(socketOptions = \"no-delay\")' -e 'Sys.setenv(\"PI\"=\"3.14\")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' MASTER=localhost PORT=11121 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential"
we could do:
> cl <- parallelly::makeClusterPSOCK("remote.example.org", rscript_envs = c(PI="3.14"), dryrun = TRUE)
----------------------------------------------------------------------
Manually, (i) login into external machine 'remote.example.org':
'/usr/bin/ssh' -R 11121:localhost:11121 remote.example.org
and (ii) start worker #1 from there:
'R' --no-echo --no-restore R_DEFAULT_PACKAGES='datasets,utils,grDevices,graphics,stats,methods' PI='3.14' -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' --args MASTER=localhost PORT=11121 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential
Alternatively, start worker #1 from the local machine by combining both step in a single call:
'/usr/bin/ssh' -R 11121:localhost:11121 remote.example.org "'R' --no-echo --no-restore R_DEFAULT_PACKAGES='datasets,utils,grDevices,graphics,stats,methods' PI='3.14' -e 'options(socketOptions = \"no-delay\")' -e 'Sys.setenv(\"PI\"=\"3.14\")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' --args MASTER=localhost PORT=11121 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential"
Note that the above R PI="3.14" ...
is for MS Windows. On all other platforms, we need to do PI="3.14" R ...
, which means we equally well can do PI="3.14" Rscript ...
there.
In parallelly (>= 1.29.0-9003), we can now do (Issue #75):
> cl <- parallelly::makeClusterPSOCK(1L, rscript = file.path(R.home("bin"), "R"), rscript_args = c("--no-echo", "--no-restore", "*", "--args"), dryrun = TRUE)
----------------------------------------------------------------------
Manually, start worker #1 on local machine 'localhost' with:
'/home/hb/software/R-devel/R-4-1-branch/lib/R/bin/R' --no-echo --no-restore --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' --args MASTER=localhost PORT=11920 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential
Now, contrary to Rscript
, R
does not support --default-packages=...
so that's ignored and we get a warning;
> cl <- parallelly::makeClusterPSOCK(1L, rscript = file.path(R.home("bin"), "R"), rscript_args = c("--no-echo", "--no-restore", "*", "--args"))
WARNING: unknown option '--default-packages=datasets,utils,grDevices,graphics,stats,methods'
> cl
Socket cluster with 1 nodes where 1 node is on host 'localhost' (R version 4.1.2 Patched (2021-11-01 r81123), platform x86_64-pc-linux-gnu)
In the develop version (commit 22993892), default packages are now set via R_DEFAULT_PACKAGES
when Rscript is not used, e.g.
cl <- parallelly::makeClusterPSOCK(1L, rscript = file.path(R.home("bin"), "R"), rscript_args = c("--no-echo", "--no-restore", "*", "--args"), dryrun = TRUE)
----------------------------------------------------------------------
Manually, start worker #1 on local machine 'localhost' with:
R_DEFAULT_PACKAGES=datasets,utils,grDevices,graphics,stats,methods '/home/hb/software/R-devel/R-4-1-branch/lib/R/bin/R' --no-echo --no-restore -e 'options(socketOptions = "no-delay")' -e 'workRSOCK <- tryCatch(parallel:::.workRSOCK, error=function(e) parallel:::.slaveRSOCK); workRSOCK()' --args MASTER=localhost PORT=11606 OUT=/dev/null TIMEOUT=2592000 XDR=FALSE SETUPTIMEOUT=120 SETUPSTRATEGY=sequential
This avoids above warning.
Currently, this R_DEFAULT_PACKAGES workaround is only applied for locally launched cluster nodes. For remote workers, we'll get a warning that it's not supported.
Currently, this R_DEFAULT_PACKAGES workaround is only applied for locally launched cluster nodes. For remote workers, we'll get a warning that it's not supported.
Update: New argument rscript_sh
is used to infer whether a cluster node is launched on MS Windows or not. This allowed me to rely on R_DEFAULT_PACKAGES
also for remote workers.
Argh... so, on MS Windows, R
does not escape quotes at the CLI like Rscript
and Rterm
, cf. https://stat.ethz.ch/pipermail/r-devel/2021-December/081371.html.
So, on MS Windows, above R
workaround has to use Rterm
instead.
makeClusterPSOCK()
gained argument 'rscript_envs' for setting environment variables in workers on startup, e.g.rscript_envs = c(FOO = "3.14", "BAR")
.Instead of doing this via
-e "Sys.setenv('<name>'='<value>')"
options, can't we do:This way we can set env vars that need to be set very early on in the R startup process in order to take place, e.g.
TMPDIR
.I've verified that the above work on Linux and Windows. Maybe worth adding an internal
with_env()
to make sure things are properly undone for the main R session.