Closed stuvet closed 1 year ago
What is makeClusterFutureSlurm()
?
Sorry, batchtools::makeClusterFunctionsSlurm
My apologies - it's very late/early here!
Corrected in the original post too.
After submitting a pull request for a bugfix in batchtools::waitForFile
here the main issue mentioned above & here has been solved, while leaving the batchtools::makeClusterFunctionSlurm
scheduler.latency at 1, but I definitely need to increasefs.latency
(currently at 80) despite changing the nfs settings to sync in the slurm cluster as hinted at here.
I'm currently doing this by overwriting the default batchtools::makeClusterFunctionsSlurm
with assignInNamespace as above.
It'd be great if future.batchtools
had a feature to allow me to pass an argument for fs.latency
through. Thanks for considering it.
Hey,
I think I just ran into exactly the same problem. Would be great to have that "tweakable"
Thx
I had a terrible time trying to work this one out, but I summarised my problems & solutions here. I didn't end up needing the proposed fs.latency
trick - it ended up being a workaround for this.
Though the solutions were posted in the ropensci/targets
repo, all the errors were propagating from batchtools
via future.batchtools
.
Hopefully they'll apply to your situation too.
thx, will look at it right away!
The problem seems to be that "..." cannot be used to pass additional arguments to the cluster function since it's called as
we would need an additional argument like .cluster_func_args
or so...
Uff, that must have been a journey... I am still struggling to see what the switch to basename did exactly. You just sped up the file lookup by 10x?
Would still be good to be able to adjust latency parameters to be on the safe side...
The benchmark figures were just to make sure I'd chosen the right option.
There real problem was a mismatch between fn
passed to batchtools::waitForFile
and the list of files to check in path
. At least during startup fn
was a full file path, while list.files(path, ...)
only gave basenames without the full.names=TRUE
argument. Though less than pretty I aimed for the PR to be robust to either full path or basename being passed to waitForFile
.
Either way this mismatch led to the function waitForFile
always failing to find the log file (at least when used during startup). If I remember correctly, this induced a race condition between the log file location being registered in the registry & the batchtools::readLog
function being called - if no log file was registered before readLog
was called it'd trigger waitForFile
, which would always fail with the 'Log file .. for job with id .. not available'
, even though it was present on disk.
Before tracking down this issue I spent a while tweaking slurm fllesystem mount latencies with no success. I suspect those people running on Slurm clusters with plenty of idle workers would never see the error. I was trying to do Slurm on the cheap - provisioning physical nodes as needed - the provisioning delay meant it was quite likely I hit waitForFile
when submitting hundreds of jobs to a Slurm partition which had 0 provisioned workers.
That PR (particularly when run on an underprovisioned Slurm partition) revealed another issue with the Slurm status codes not all being mapped in the default makeClusterFunctionsSlurm
, but after both had been updated my targets
pipelines have been rock solid for ~6mo when running via future.batchtools
.
Better late than never ... This has been implemented for the next release, e.g.
> library(future.batchtools)
> plan(batchtools_sge, fs.latency = 42, scheduler.latency = 3.141)
> f <- future(42)
> str(f$config$cluster.functions)
List of 11
$ name : chr "SGE"
$ submitJob :function (reg, jc)
$ killJob :function (reg, batch.id)
$ listJobsQueued :function (reg)
$ listJobsRunning :function (reg)
$ array.var : chr NA
$ store.job.collection: logi TRUE
$ store.job.files : logi FALSE
$ scheduler.latency : num 3.14
$ fs.latency : num 42
$ hooks : list()
- attr(*, "class")= chr "ClusterFunctions"
- attr(*, "template")= 'fs_path' chr "~/.batchtools.sge.tmpl"
>
Until future.batchtools 0.11.0 is on CRAN, use:
remotes::install_github("HenrikBengtsson/future.batchtools", ref = "develop")
future.batchtools 0.11.0 fixing this is now on CRAN.
After a lot of troubleshooting I've submiited a related bug report/feature request here at batchtools.
Long story short, jobs submitted via future.batchtools are timing out in readLog, even though the jobs do exist. This is resolved by altering scheduler.latency & fs.latency.
I'm setting up my futures like this:
(unless I've done something stupid...) for
future.batchtools
to work reliably in my environment I need to be able to setfs.latency
&scheduler.latency
from future::tweak, or somewhere else. As far as I can see, these don't currently get passed through tobatchtools::makeClusterFunctionsSlurm
.I'm currently getting around this problem by overwriting the default
batchtools::makeClusterFunctionsSlurm
with assignInNamespace. Setting 70 seconds forscheduler.latency
and 10 seconds forfs.latency
solves my problem and makes future.batchtools run jobs reliably desipite the provisioning of machines. Unfortunately this increases the delay for batchtools to recognise that the job has finished. No big deal for long-running jobs, but I've made a feature request atbatchtools
for the scheduler.latency option to be split, with a new one to cover the initial sleep.Thanks