"no worker was ever available" when running on a slow slurm cluster

FlorianPommerening commented 1 year ago

This came up in #998, I'll repeat the relevant parts here for easier reference.

Description

I want to parallelize SMAC on a slurm cluster. The cluster only schedules new jobs once every 15 seconds

Steps/Code to Reproduce

An example is available on: https://ai.dmi.unibas.ch/_experiments/pommeren/innosuisse/mwe/

benchmarks.py contains the list of instances and their features.
gurobi.py contains the model (configuration space and trial evaluation function).
run_smac.py contains the actual call to smac, the dask client and so on.
setup.sh shows what software I installed: gurobipy, dask_jobqueue, swig and SMAC on the development branch as of last week (the code needs #997, which I merged locally for previous tests, but now it's already on the dev branch). All of this is now in release 2.0.1.

I tried cutting the example down to the essentials but it still optimizes a model that relies on Gurobi and some instances that I unfortunately cannot share. If you have a simpler model, you want me to try instead, I can see if the error still occurs there. So far, if I execute the code as is, everything works fine, but if I remove the time.sleep(10) in line 61, I get the following output:

Expected Results

I would expect the runner to wait until the workers are fully scheduled on the grid before giving up on them.

Actual Results

The runner waits for some time (_patience in DaskParallelRunner) and then counts the worker as failed. If that happens to all workers, the optimization doesn't start and produces the error below. I suspect that if it happens to some but not all workers, the optimization will start but only use those workers that were ready in time.

Either adding a time.sleep(10) or setting my_facade._runner._patience to 15 before the optimization seemed to fix the issue for me. It is somewhat hard to see, because the bug is not perfectly reproducible. I assume this has to do with the 15 second scheduling frequency of our slurm cluster: if dask submits the workers just before the next "tick" of slurm, they will be scheduled quickly, but if this happens just after the tick, then it will take at least 15 seconds. All of this assumes grid resources are available at all. I have not tried on a busy grid.

[WARNING][abstract_facade.py:192] Provided `dask_client`. Ignore `scenario.n_workers`, directly set `n_workers` in `dask_client`.
[INFO][abstract_initial_design.py:147] Using 0 initial design configurations and 1 additional configurations.
[INFO][abstract_intensifier.py:305] Using only one seed for deterministic scenario.
[WARNING][dask_runner.py:127] No workers are available. This could mean workers crashed. Waiting for new workers...
Traceback (most recent call last):
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/./run_smac.py", line 62, in <module>
    incumbent = smac.optimize()
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/facade/abstract_facade.py", line 303, in optimize
    incumbents = self._optimizer.optimize()
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/main/smbo.py", line 284, in optimize
    self._runner.submit_trial(trial_info=trial_info)
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/runner/dask_runner.py", line 130, in submit_trial
    raise RuntimeError(
RuntimeError: Tried to execute a job, but no worker was ever available.This likely means that a worker crashed or no workers were properly configured.

Versions

I ran this on a version of the development branch with some feature branches merged. It should be equivalent to what is now release 2.0.1.

benjamc commented 1 year ago

Hi Florian, it is hard to recreate the issue on our setup. However, I might have a fix. You could try this branch https://github.com/automl/SMAC3/tree/fix/daskworker Here dask waits for at least one worker to be scheduled. Could you try this and report back if that worked for you?

alexandertornede commented 1 year ago

@FlorianPommerening Did you check on the branch @benjamc mentioned above?

mens-artis commented 1 year ago

On my cluster, the fix works when I don't ask for worker_extra_args=["--gpus-per-task=2"] which end up in /usr/bin/python3.10 -m distributed.cli.dask_worker tcp://..166.214:38861 --nthreads 1 --memory-limit 0.93GiB --name dummy-name --nanny --death-timeout 60 --gpus-per-task=2 When I use job_extra_directives=["--gres=gpu:2"] however, no gpus are every alloted as far as I can tell. I think there is another argument where a gpu request might be passed, but I also print the cluster.job_script() and it is the same as when I write the job script myself (with #SBATCH --gres=gpu:2). When I write the job script myself, the GPUs are allotted, but I need to use SMAC obviously.

mfeurer commented 12 months ago

Hey folks, I started an example in #1064 in which workers are started manually. There's something not yet working in there, but you may use it as a starting point to achieve what you want. In case you get the example working, please consider updating the PR I made.

FlorianPommerening commented 11 months ago

Sorry for being quiet for so long. Some deadlines and holidays got in the way. I now tired reproducing the problem again on the current dev branch but couldn't. The behavior was always a bit difficult to reproduce because it relies heavily on the timing. The patch in https://github.com/automl/SMAC3/tree/fix/daskworker makes sense to me, but since I couldn't reproduce the error, I'm also fine with just closing this issue.

stale[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

automl / SMAC3