Open timothydmorton opened 4 years ago
It defaults to the debug cluster that has a 30 min limit. On the normal cluster it should be unlimited according to https://developer.lsst.io/services/verification.html?highlight=slurm
Maybe we should ask? I think we can also explicitly set a wall time in dask-jobqueue
Yes, no time limit:
(lsst-scipipe) [dharhas@lsst-dev01 ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug up 30:00 3 idle lsst-verify-worker[46-48]
normal* up infinite 12 drain* lsst-verify-worker[37-40,43-44,55-60]
normal* up infinite 8 alloc lsst-verify-worker[01-08]
normal* up infinite 37 idle lsst-verify-worker[09-36,41-42,45,49-54]
I know the normal queue has default unlimited time; I was wondering whether dask-jobqueue creates the job with an implicit time limit if one is not specifically requested.
I don't know if this is possible at all, but I have several times run into an issue where the dask cluster on lsst-dev (even on the "normal" queue) dies for unknown reasons after some (somewhat long) amount of time. Is there some default implicit time limit in the dask-jobqueue slurm job?