futureverse / future.mirai

:rocket: R package future.mirai: A Future API for Parallel Processing using 'mirai'
https://future.mirai.futureverse.org/
21 stars 1 forks source link

Help: slurm cluster example #12

Open kkmann opened 2 months ago

kkmann commented 2 months ago

Hello,

{future.mirai} is a dream :) Any chance we could get a minimal working example for getting this to work on a slurm cluster? I am struggling to connect the dots.

Do I need to set up the daemons manually? https://shikokuchuo.net/mirai/reference/daemons.html?

Thanks to all authors for making this happen :)

HenrikBengtsson commented 2 months ago

@shikokuchuo , what's the most direct way of launching mirai workers on a set of hosts over SSH when we have a vector of local hostnames?

The gist is that with Slurm you can submit a job requesting say 50 tasks (="workers") that Slurm may reserve slots for across multiple hosts, e.g.

sbatch --ntasks=50 my_script.sh

This will result in my_script.sh being launched on one host, with an environment variable saying which the other hosts are and how many slots. The latter information is parsed by and available via hostnames <- parallelly::availableWorkers(). From here, the challenge is to launch the length(hostnames) mirai workers. Would it be something like:

hostnames <- parallelly::availableWorkers()
library(mirai)
daemons(
  url = host_url(),
  remote = ssh_config(remotes = paste0("ssh://", hostnames))
)

If that works, then:

plan(future.mirai::mirai_cluster)

should make Futurverse resolve futures via that cluster of mirai workers.

michaelmayer2 commented 2 months ago

The below code is using the slurmR package as a tool to flexibly create a pool of compute resources and then use them in the mirai::daemons() call. Once done, a simple plan(mirai_cluster) will enable parallel futures as expected.

Couple of points/questions

library(future.mirai)
library(mirai)
library(furrr)
library(tictoc)
library(dplyr)
library(slurmR)
opts_slurmR$set_opts(mem="1024m")

#we'd like to run on 10 cores
compute_cores<-10

#allocate compute nodes via slurmR
cl_slurm<-makeSlurmCluster(compute_cores)

#wrapper to convert hostnames into mirai compatible string
get_nodes <- function(cl) {
  paste0("ssh://",sapply(1:length(cl), function(x) cl[[x]]$host))
}

mirai::daemons(compute_cores,
                   url = host_url(tls = TRUE),
                   remote = ssh_config(
                     remotes = get_nodes(cl_slurm),
                     timeout = 1,
                     rscript = paste0(Sys.getenv("R_HOME"),"/bin/Rscript")
                   )
)

# let's use mirai_cluster
plan(mirai_cluster)
tic()
nothingness <- future_map(rep.int(2,10), ~Sys.sleep(.x))
toc()

# let's use sequential
plan(sequential)
tic()
nothingness <- future_map(rep.int(2,10), ~Sys.sleep(.x))
toc()

stopCluster(cl_slurm)
michaelmayer2 commented 2 months ago

Making a bit more progress... while the above statements around ssh usage on an HPC still hold, I went back to first principles and figured out why remote_config() was not working when using SLURM's srun - it turns out that due to the fact that the Rscript -e ... command is wrapped in double-quotes, srun will treat the whole command as a single binary and then fail to find it (e.g.srun "echo 30" will fail).

I have been experimenting with removing shQuote from the relevant bits of the code and while this worked for some use cases it did not work in general. So finally I decided to change the behaviour of mirai a bit more than expected: Instead of dynamically creating strings containing R code that will then be interpreted by Rscript -e as an expression, I have opted to save the R code as temporary file and then call it by Rscript. The changes made (see below patch) work for all use cases I checked including TLS on/off, classic remote_ssh etc... Only gap at the moment is that the temp files are not cleaned up.

So, using the patch in the mirai package, I get

library(future.mirai)
library(mirai)
library(furrr)
library(dplyr)
library(microbenchmark)

# launch mirai daemons
# 
# please note the specification of SLURM resource requirements as args
compute_cores <- 4
mirai::daemons(compute_cores,
               url = host_url(ws=TRUE, tls = TRUE),
               remote = remote_config(
                 command="srun",
                 args=c("--mem 512", "-n 1", "."),
                 rscript = paste0(Sys.getenv("R_HOME"),"/bin/Rscript")
               ),
               dispatcher=TRUE
)

# start mirai_cluster future 
plan(mirai_cluster)

microbenchmark(res<-future_map_dbl(1:500, function(x){
  mean(runif(180000))
},.options=furrr_options(seed=TRUE)),times=10)

Maybe this is something that @shikokuchuo could integrate into mirai ? I have to admit I really don't like the idea of having temporary files created but it seems that both the size and number of files is very small and hence the execution speed is practically not affected at all.

mirai.patch

shikokuchuo commented 2 months ago

@michaelmayer2 thanks for investigating. I'll take a closer look at the shell quoting behaviour of remote_config(). As you rightly point out, writing temporary files is probably not the way to go.

shikokuchuo commented 2 months ago

Michael, in build 9001 (39ce672) the shell quoting is updated so the argument passed to Rscript is wrapped in single rather than double quotes. This used to be the case in mirai, but changed for some reason in the interim.

You may test with the R-Universe dev build:

install.packages("mirai", repos = "shikokuchuo.r-universe.dev")

I hope this helps with SLURM, but even if not I believe it is safer to shell quote in this way - it may avoid other corner cases. If it doesn't solve the SLURM issue, I have a couple of other ideas, although from the man page for srun it does seem like it should just work.

michaelmayer2 commented 2 months ago

@shikokuchuo - thanks so much for looking into this, Charlie !

I tried with the latest changes but I am sorry to report it is still not working... The crucial bit really seems to be the shQuote() in https://github.com/shikokuchuo/mirai/blob/39ce672609dfbffc0dfd1982a9b12641fea8754d/R/launchers.R#L151

In order to better demonstrate what is going on I have replaced this line with a system() command.

system(paste(c(command,`[<-`(args, find_dot(args), shQuote(cmd))),collapse=" "),wait=FALSE)

I then checked the following use cases

  1. classic ssh_config mirai worker a. with shQuote() enabled b. without shQuote()
  2. remote_config mirai worker a. with shQuote() enabled b. without shQuote()

See detailed results below, the gist is that 1a and 2b work while 1b and 2a fail. And this is caused by a different behaviour of ssh and srun when it comes to dealing with double quotes.

While srun echo hello will work, srun "echo hello" will fail as it cannot find a binary named "sleep 10" (it confuses the whole command including parameters as an executable).

posit0001@interactive-st-rstudio-1:~/mirai$ srun  "echo hello"
slurmstepd: error: execve(): echo hello: No such file or directory
srun: error: interactive-dy-rstudio-1: task 0: Exited with exit code 2
posit0001@interactive-st-rstudio-1:~/mirai$ srun echo hello
hello

I am not sure how to go from here. Happy to supply more information as needed. Maybe we can make the problematic shQuote() optional via an argument ?


Case 1a - classic ssh_config mirai worker with shQuote() enabled

Browse[2]> paste(c(command,`[<-`(args, find_dot(args), shQuote(cmd))),collapse=" ")
[1] "ssh -o ConnectTimeout=1 -fTp 22 localhost \"/opt/R/4.3.2/lib/R/bin/Rscript -e 'mirai::daemon(\\\"tcp://interactive-st-rstudio-1:42097\\\",rs=c(10407,648977717,1963234418,-2069452469,1499029520,1988279505,1192808542))'\""
Browse[2]> system(paste(c(command,`[<-`(args, find_dot(args), shQuote(cmd))),collapse=" "),wait=FALSE)
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Browse[2]> daemons()
$connections
[1] 1

$daemons
                                     i online instance assigned complete
tcp://interactive-st-rstudio-1:42097 1      1        1        0        0

Case 1b classic ssh_config mirai worker without shQuote()

Browse[2]> paste(c(command,`[<-`(args, find_dot(args), cmd)),collapse=" ")
[1] "ssh -o ConnectTimeout=1 -fTp 22 localhost /opt/R/4.3.2/lib/R/bin/Rscript -e 'mirai::daemon(\"tcp://interactive-st-rstudio-1:42097\",rs=c(10407,648977717,1963234418,-2069452469,1499029520,1988279505,1192808542))'"
Browse[2]> system(paste(c(command,`[<-`(args, find_dot(args), cmd)),collapse=" "),wait=FALSE)
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
bash: -c: line 0: syntax error near unexpected token `('
bash: -c: line 0: `/opt/R/4.3.2/lib/R/bin/Rscript -e mirai::daemon("tcp://interactive-st-rstudio-1:42097",rs=c(10407,648977717,1963234418,-2069452469,1499029520,1988279505,1192808542))'

Case 2a remote_config mirai worker with shQuote() enabled

Browse[1]> paste(c(command,`[<-`(args, find_dot(args), shQuote(cmd))),collapse=" ")
[1] "srun --mem 512 -n 1 -o slurm.loo \"/opt/R/4.3.2/lib/R/bin/Rscript -e 'mirai::daemon(\\\"tcp://interactive-st-rstudio-1:38907\\\",rs=c(10407,1413271533,1529776586,-351430461,-2090321112,-1063229687,-860424394))'\""
Browse[1]> system(paste(c(command,`[<-`(args, find_dot(args), shQuote(cmd))),collapse=" "),wait=FALSE)
srun: error: interactive-dy-rstudio-1: task 0: Exited with exit code 2

Case 2b remote_config mirai worker without shQuote()

Browse[1]> paste(c(command,`[<-`(args, find_dot(args), cmd)),collapse=" ")
[1] "srun --mem 512 -n 1 -o slurm.loo /opt/R/4.3.2/lib/R/bin/Rscript -e 'mirai::daemon(\"tcp://interactive-st-rstudio-1:38907\",rs=c(10407,1413271533,1529776586,-351430461,-2090321112,-1063229687,-860424394))'"
Browse[1]> system(paste(c(command,`[<-`(args, find_dot(args), cmd)),collapse=" "),wait=FALSE)
Browse[1]> daemons()
$connections
[1] 1

$daemons
                                     i online instance assigned complete
tcp://interactive-st-rstudio-1:38907 1      1        1        0        0
HenrikBengtsson commented 2 months ago

Thank you both. I've gone through quote a few of these quote-or-not-to-quote and nested-quoting issues in the parallelly package. It grew out of different needs to launch parallel R workers locally, remotely, in Linux containers, over SSH, over qrsh (similar to srun), from and to different operating systems, etc. Have a look at https://parallelly.futureverse.org/reference/makeClusterPSOCK.html and arguments rshcmd, rshopts, rscript, and rscript_args. Look also at the different examples. Note how both rshcmd and rscript are vectors, and how the first element is specially treated. FWIW, my constraint was to also be backward compatible with the parallel package, so some solutions might not be the one you would pick if you started from scratch. @shikokuchuo, I suspect you might have to do something similar in order to support different types of uses that will be throwh at remote_config() and ssh_config().

PS. @michaelmayer2, the canonical way to get the location of current Rscript is file.path(R.home("bin"), "Rscript").

HenrikBengtsson commented 2 months ago

I just checked the parallelly code; it suffers from the same problem. I'll see if there's workaround/hack or if I have to update the package.