dask / dask-jobqueue

Deploy Dask on job schedulers like PBS, SLURM, and SGE
https://jobqueue.dask.org
BSD 3-Clause "New" or "Revised" License
235 stars 142 forks source link

Worker startup timeout leads to inconsistent cluster state #620

Open zmbc opened 11 months ago

zmbc commented 11 months ago

Describe the issue: I am using dask_jobqueue on a Slurm cluster. I noticed that frequently, a cluster that I scaled up to 50 jobs would only actually scale to 45 or so. If I called wait_for_workers, it would hang indefinitely. The Slurm logs showed that the workers that never joined had failed with Reason: failure-to-start-<class 'asyncio.exceptions.TimeoutError'>. However the SlurmCluster object didn't seem to be picking up that these jobs had failed, and continued to list them as running.

Minimal Complete Verifiable Example: I have yet to be able to consistently reproduce this, since the timeouts are intermittent. Any pointers on how to cause this timeout on command would be appreciated, and help me create an MCVE.

Anything else we need to know?: Nope

Environment:

guillaumeeb commented 10 months ago

Hi @zmbc,

I've seen that on some HPC cluster. Somehow, launching a lot of jobs using the same software environment at the same time can cause slow down of Workers starting up.

It is true that currently, dask-jobqueue does not enforce the number of really started Workers, because there is no way to tell wether Workers are not there because of queue congestion, walltime reached, or in your case start-up failure. But I might be missing something, what do you mean by:

SlurmCluster object didn't seem to be picking up that these jobs had failed, and continued to list them as running.

There is probably a way of increasing Worker start-up timeout though.

zmbc commented 10 months ago

Somehow, launching a lot of jobs using the same software environment at the same time can cause slow down of Workers starting up.

Ah, that sounds like a good theory on why the timeouts occur: it is probably that the software environment itself (code files) are stored on a network file system, and are trying to be concurrently read on a large number of different nodes.

because there is no way to tell wether Workers are not there because of queue congestion, walltime reached, or in your case start-up failure

Really? On Slurm, it is trivial to check this: use squeue or sacct to see whether the job is in the failed state. A failed job always needs to be restarted, a pending job doesn't. In fact I have automated this as a workaround.

guillaumeeb commented 10 months ago

Really? On Slurm, it is trivial to check this: use squeue or sacct to see whether the job is in the failed state. A failed job always needs to be restarted, a pending job doesn't. In fact I have automated this as a workaround.

Yes, I know that there are meaning to do this in all Scheduler abstractions. But this might not be as simple to implement this in this package, there has been discussion already just to retrieve the real "HPC Scheduler" status: https://github.com/dask/dask-jobqueue/issues/11.

This would be really nive to have contributions here!