Open stuvet opened 3 years ago
This seems to be caused by future.batchtools
calling batchtools::getStatus
, which defaults to 'expired' if is.na(status)
. It doesn't affect batchtools
as it's waiting for the log to appear in waitForFile
.
Here's an example output of batchtools::getStatus
while the worker is provisioning (& batchtools is in the waitForFile
loop):
## result of batcthools::getStatus
Status for 1 jobs at 2021-06-05 09:20:42:
Submitted : 1 (100.0%)
-- Queued : 0 ( 0.0%)
-- Started : 0 ( 0.0%)
---- Running : 0 ( 0.0%)
---- Done : 0 ( 0.0%)
---- Error : 0 ( 0.0%)
---- Expired : 1 (100.0%)
## The reg entry during provisioning
batch.id job.id def.id submitted started done error mem.used resource.id
1: 5955 1 1 1622885612 NA NA <NA> NA 1
log.file job.hash job.name status
1: <NA> jobc59eb7d6a4f24ec81611dd0a592cb9de todo_chains_2e8f2ae4 <NA>
Perhaps an explicit call to batchtools::waitForFile
in future.batchtools (linked to an option to set fs.latency from #73?) could be a workaround for this, though I'll take a closer look at batchtools to see if I can adapt the getStatus
result for this case - it may be possible to differentiate 'provisioning' from 'expired' if the log.file is populated in the reg entry, though this strategy may cause jobs to run indefinitely if they expire before the log file is produced?
I've taken a shot at this this problem & have submitted a PR.
With this fix, there seems to be no need to change fs.latency from 65, so #73 isn't necessary.
@stuvet I tried out the development version of batchtools
with the proposed change. However, when I/O is lagged on a distributed system like so:
The job still automatically fails once loaded with:
Error: Log file '%s' for job with id %i not available
I've also attempted a crude hack to set fs.latency
via setting up a default registry file at ~/.batchtools.conf.R
per batchtools
registry vignette with:
cluster.functions = batchtools::makeClusterFunctionsSlurm(
scheduler.latency = 60, # default is 1
fs.latency = 120) # default is 65
However, even with inserting a shim directly into batchtools
, jobs still will fail.
This was my first attempt (before I understood the problem completely) & it didn't solve the problem entirely.
Perhaps this may help - obviously made for users of targets
but it was hitting the same problem you are, via dependencies on future.batchtools & batchtools.
Hope it helps you.
https://github.com/ropensci/targets/discussions/570
Also if I remember right there are some tweaks for the mount flags in /etc/fstab that could help files appear faster in heavy I/O scenarios - if the bugfixes don't help perhaps the workers actually can't see the log files before they timeout. I'm no expert here. I did take a long look at the flags here & changed some. I'll try to find the slurm-specific documentation & I'll update with the flags that worked for me.
EDIT: on second thoughts is it possible the spike in I/O you're seeing actually reflects/is associated with recruitment of new worker machines? If so, that's exactly what I was seeing without the I/o - I was scaling up from 0 & seeing this reliably.
I have submitted a simple pull request for a bugfix in
batchtools
which fuelled some of its behaviour mentioned in #73.I'm writing here because this proposed bugfix reveals an error in future.batchtools, though based on the future.debug output I don't believe the two are related - just that the batchtools bug previously threw the error first.
Describe the bug When a slurm worker is availble for the job, (& when the batchtools::waitForFile is not required by batchtools::getLog -> batchtools::readLog), everything functions correctly (note the status):
But when a slurm worker needs to be provisioned to run the job (& so the batchtools::waitForFile will also be called), the initial result of the call to
future.batchtools::status(future)
infuture.batchtools::await
is incorrect:At this point, logging inserted into
batchtools:::waitForFile
begins to appear. No more future.debug messages appear until the log file has been detected (now, after the proposed bugfix) &batchtools::waitForFile
exits.To be clear, the logged output does exist, and continues to be written by the running job after future.batchtools::await flags the job as expired & exits.
Please let me know if there's anything else I can do to help resolve this issue.
Expected behavior future.batchtools::await waits for the running job to exit, even when workers need to be provisioned or batchtools::waitForFile is triggered.
Session information