HenrikBengtsson / future.batchtools

:rocket: R package future.batchtools: A Future API for Parallel and Distributed Processing using batchtools
https://future.batchtools.futureverse.org
84 stars 9 forks source link

Slurm readLog() Error - Option to change fs.latency & scheduler.latency from batchtools_slurm or future::tweak #73

Closed stuvet closed 1 year ago

stuvet commented 3 years ago

After a lot of troubleshooting I've submiited a related bug report/feature request here at batchtools.

Long story short, jobs submitted via future.batchtools are timing out in readLog, even though the jobs do exist. This is resolved by altering scheduler.latency & fs.latency.

I'm setting up my futures like this:

slurm<-future::tweak(batchtools_slurm, 
                                   template = 'batchtools.slurm.tmpl', 
                                    workers = 4,
                                    resources = list(
                                       walltime = 7200,
                                       memory = 2048,
                                       ncpus = 4,
                                       ntasks = 4,
                                       partition = 'ph8' # GCE 'n2d-highcpu-8'
                           )
future::plan(slurm)

(unless I've done something stupid...) for future.batchtools to work reliably in my environment I need to be able to set fs.latency & scheduler.latency from future::tweak, or somewhere else. As far as I can see, these don't currently get passed through to batchtools::makeClusterFunctionsSlurm.

I'm currently getting around this problem by overwriting the default batchtools::makeClusterFunctionsSlurm with assignInNamespace. Setting 70 seconds for scheduler.latency and 10 seconds for fs.latency solves my problem and makes future.batchtools run jobs reliably desipite the provisioning of machines. Unfortunately this increases the delay for batchtools to recognise that the job has finished. No big deal for long-running jobs, but I've made a feature request at batchtools for the scheduler.latency option to be split, with a new one to cover the initial sleep.

Thanks

HenrikBengtsson commented 3 years ago

What is makeClusterFutureSlurm()?

stuvet commented 3 years ago

Sorry, batchtools::makeClusterFunctionsSlurm

My apologies - it's very late/early here!

Corrected in the original post too.

stuvet commented 3 years ago

After submitting a pull request for a bugfix in batchtools::waitForFile here the main issue mentioned above & here has been solved, while leaving the batchtools::makeClusterFunctionSlurm scheduler.latency at 1, but I definitely need to increasefs.latency (currently at 80) despite changing the nfs settings to sync in the slurm cluster as hinted at here.

I'm currently doing this by overwriting the default batchtools::makeClusterFunctionsSlurm with assignInNamespace as above.

It'd be great if future.batchtools had a feature to allow me to pass an argument for fs.latency through. Thanks for considering it.

kkmann commented 2 years ago

Hey,

I think I just ran into exactly the same problem. Would be great to have that "tweakable"

Thx

stuvet commented 2 years ago

I had a terrible time trying to work this one out, but I summarised my problems & solutions here. I didn't end up needing the proposed fs.latency trick - it ended up being a workaround for this.

Though the solutions were posted in the ropensci/targets repo, all the errors were propagating from batchtools via future.batchtools.

Hopefully they'll apply to your situation too.

kkmann commented 2 years ago

thx, will look at it right away!

The problem seems to be that "..." cannot be used to pass additional arguments to the cluster function since it's called as

https://github.com/HenrikBengtsson/future.batchtools/blob/17adb2d383de2a909c4da2703b866f82d6b8f232/R/batchtools_template.R#L181

we would need an additional argument like .cluster_func_args or so...

kkmann commented 2 years ago

Uff, that must have been a journey... I am still struggling to see what the switch to basename did exactly. You just sped up the file lookup by 10x?

Would still be good to be able to adjust latency parameters to be on the safe side...

stuvet commented 2 years ago

The benchmark figures were just to make sure I'd chosen the right option.

There real problem was a mismatch between fn passed to batchtools::waitForFile and the list of files to check in path. At least during startup fn was a full file path, while list.files(path, ...) only gave basenames without the full.names=TRUE argument. Though less than pretty I aimed for the PR to be robust to either full path or basename being passed to waitForFile.

Either way this mismatch led to the function waitForFile always failing to find the log file (at least when used during startup). If I remember correctly, this induced a race condition between the log file location being registered in the registry & the batchtools::readLog function being called - if no log file was registered before readLog was called it'd trigger waitForFile, which would always fail with the 'Log file .. for job with id .. not available', even though it was present on disk.

Before tracking down this issue I spent a while tweaking slurm fllesystem mount latencies with no success. I suspect those people running on Slurm clusters with plenty of idle workers would never see the error. I was trying to do Slurm on the cheap - provisioning physical nodes as needed - the provisioning delay meant it was quite likely I hit waitForFile when submitting hundreds of jobs to a Slurm partition which had 0 provisioned workers.

That PR (particularly when run on an underprovisioned Slurm partition) revealed another issue with the Slurm status codes not all being mapped in the default makeClusterFunctionsSlurm, but after both had been updated my targets pipelines have been rock solid for ~6mo when running via future.batchtools.

HenrikBengtsson commented 1 year ago

Better late than never ... This has been implemented for the next release, e.g.

> library(future.batchtools)
> plan(batchtools_sge, fs.latency = 42, scheduler.latency = 3.141)

> f <- future(42)
> str(f$config$cluster.functions)
List of 11
 $ name                : chr "SGE"
 $ submitJob           :function (reg, jc)  
 $ killJob             :function (reg, batch.id)  
 $ listJobsQueued      :function (reg)  
 $ listJobsRunning     :function (reg)  
 $ array.var           : chr NA
 $ store.job.collection: logi TRUE
 $ store.job.files     : logi FALSE
 $ scheduler.latency   : num 3.14
 $ fs.latency          : num 42
 $ hooks               : list()
 - attr(*, "class")= chr "ClusterFunctions"
 - attr(*, "template")= 'fs_path' chr "~/.batchtools.sge.tmpl"
> 

Until future.batchtools 0.11.0 is on CRAN, use:

remotes::install_github("HenrikBengtsson/future.batchtools", ref = "develop")
HenrikBengtsson commented 1 year ago

future.batchtools 0.11.0 fixing this is now on CRAN.