Closed FlorianPommerening closed 9 months ago
Hi Florian, it is hard to recreate the issue on our setup. However, I might have a fix. You could try this branch https://github.com/automl/SMAC3/tree/fix/daskworker Here dask waits for at least one worker to be scheduled. Could you try this and report back if that worked for you?
@FlorianPommerening Did you check on the branch @benjamc mentioned above?
On my cluster, the fix works when I don't ask for worker_extra_args=["--gpus-per-task=2"]
which end up in
/usr/bin/python3.10 -m distributed.cli.dask_worker tcp://..166.214:38861 --nthreads 1 --memory-limit 0.93GiB --name dummy-name --nanny --death-timeout 60 --gpus-per-task=2
When I use job_extra_directives=["--gres=gpu:2"]
however, no gpus are every alloted as far as I can tell. I think there is another argument where a gpu request might be passed, but I also print the cluster.job_script()
and it is the same as when I write the job script myself (with #SBATCH --gres=gpu:2
). When I write the job script myself, the GPUs are allotted, but I need to use SMAC obviously.
Hey folks, I started an example in #1064 in which workers are started manually. There's something not yet working in there, but you may use it as a starting point to achieve what you want. In case you get the example working, please consider updating the PR I made.
Sorry for being quiet for so long. Some deadlines and holidays got in the way. I now tired reproducing the problem again on the current dev branch but couldn't. The behavior was always a bit difficult to reproduce because it relies heavily on the timing. The patch in https://github.com/automl/SMAC3/tree/fix/daskworker makes sense to me, but since I couldn't reproduce the error, I'm also fine with just closing this issue.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This came up in #998, I'll repeat the relevant parts here for easier reference.
Description
I want to parallelize SMAC on a slurm cluster. The cluster only schedules new jobs once every 15 seconds
Steps/Code to Reproduce
An example is available on: https://ai.dmi.unibas.ch/_experiments/pommeren/innosuisse/mwe/
benchmarks.py
contains the list of instances and their features.gurobi.py
contains the model (configuration space and trial evaluation function).run_smac.py
contains the actual call to smac, the dask client and so on.setup.sh
shows what software I installed:gurobipy
,dask_jobqueue
,swig
and SMAC on the development branch as of last week (the code needs #997, which I merged locally for previous tests, but now it's already on the dev branch). All of this is now in release 2.0.1.I tried cutting the example down to the essentials but it still optimizes a model that relies on Gurobi and some instances that I unfortunately cannot share. If you have a simpler model, you want me to try instead, I can see if the error still occurs there. So far, if I execute the code as is, everything works fine, but if I remove the
time.sleep(10)
in line 61, I get the following output:Expected Results
I would expect the runner to wait until the workers are fully scheduled on the grid before giving up on them.
Actual Results
The runner waits for some time (
_patience
inDaskParallelRunner
) and then counts the worker as failed. If that happens to all workers, the optimization doesn't start and produces the error below. I suspect that if it happens to some but not all workers, the optimization will start but only use those workers that were ready in time.Either adding a
time.sleep(10)
or settingmy_facade._runner._patience
to 15 before the optimization seemed to fix the issue for me. It is somewhat hard to see, because the bug is not perfectly reproducible. I assume this has to do with the 15 second scheduling frequency of our slurm cluster: if dask submits the workers just before the next "tick" of slurm, they will be scheduled quickly, but if this happens just after the tick, then it will take at least 15 seconds. All of this assumes grid resources are available at all. I have not tried on a busy grid.Versions
I ran this on a version of the development branch with some feature branches merged. It should be equivalent to what is now release 2.0.1.