Open zmbc opened 11 months ago
Hi @zmbc,
I've seen that on some HPC cluster. Somehow, launching a lot of jobs using the same software environment at the same time can cause slow down of Workers starting up.
It is true that currently, dask-jobqueue does not enforce the number of really started Workers, because there is no way to tell wether Workers are not there because of queue congestion, walltime reached, or in your case start-up failure. But I might be missing something, what do you mean by:
SlurmCluster object didn't seem to be picking up that these jobs had failed, and continued to list them as running.
There is probably a way of increasing Worker start-up timeout though.
Somehow, launching a lot of jobs using the same software environment at the same time can cause slow down of Workers starting up.
Ah, that sounds like a good theory on why the timeouts occur: it is probably that the software environment itself (code files) are stored on a network file system, and are trying to be concurrently read on a large number of different nodes.
because there is no way to tell wether Workers are not there because of queue congestion, walltime reached, or in your case start-up failure
Really? On Slurm, it is trivial to check this: use squeue
or sacct
to see whether the job is in the failed state. A failed job always needs to be restarted, a pending job doesn't. In fact I have automated this as a workaround.
Really? On Slurm, it is trivial to check this: use squeue or sacct to see whether the job is in the failed state. A failed job always needs to be restarted, a pending job doesn't. In fact I have automated this as a workaround.
Yes, I know that there are meaning to do this in all Scheduler abstractions. But this might not be as simple to implement this in this package, there has been discussion already just to retrieve the real "HPC Scheduler" status: https://github.com/dask/dask-jobqueue/issues/11.
This would be really nive to have contributions here!
Describe the issue: I am using dask_jobqueue on a Slurm cluster. I noticed that frequently, a cluster that I scaled up to 50 jobs would only actually scale to 45 or so. If I called
wait_for_workers
, it would hang indefinitely. The Slurm logs showed that the workers that never joined had failed withReason: failure-to-start-<class 'asyncio.exceptions.TimeoutError'>
. However the SlurmCluster object didn't seem to be picking up that these jobs had failed, and continued to list them as running.Minimal Complete Verifiable Example: I have yet to be able to consistently reproduce this, since the timeouts are intermittent. Any pointers on how to cause this timeout on command would be appreciated, and help me create an MCVE.
Anything else we need to know?: Nope
Environment: