dask cluster (sometimes?) dies after some time

Quansight / lsst_dashboard

LSST Dashboard https://quansight.github.io/lsst_dashboard/

BSD 3-Clause "New" or "Revised" License

8 stars 3 forks source link

dask cluster (sometimes?) dies after some time #168

Open timothydmorton opened 4 years ago

timothydmorton commented 4 years ago

I don't know if this is possible at all, but I have several times run into an issue where the dask cluster on lsst-dev (even on the "normal" queue) dies for unknown reasons after some (somewhat long) amount of time. Is there some default implicit time limit in the dask-jobqueue slurm job?

dharhas commented 4 years ago

It defaults to the debug cluster that has a 30 min limit. On the normal cluster it should be unlimited according to https://developer.lsst.io/services/verification.html?highlight=slurm

Maybe we should ask? I think we can also explicitly set a wall time in dask-jobqueue

dharhas commented 4 years ago

Yes, no time limit:

(lsst-scipipe) [dharhas@lsst-dev01 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug        up      30:00      3   idle lsst-verify-worker[46-48]
normal*      up   infinite     12 drain* lsst-verify-worker[37-40,43-44,55-60]
normal*      up   infinite      8  alloc lsst-verify-worker[01-08]
normal*      up   infinite     37   idle lsst-verify-worker[09-36,41-42,45,49-54]

timothydmorton commented 4 years ago

I know the normal queue has default unlimited time; I was wondering whether dask-jobqueue creates the job with an implicit time limit if one is not specifically requested.