Closed benjamc closed 1 year ago
Hi @FlorianPommerening ,
I created a dask client example for a SLURM cluster, you can find it in this PR #1001 under examples/1_basics/7_parallelization_cluster.py
.
Hey @benjamc, it's great to see progress in this direction.
I would suggest to also add an example that does not require a custom client, but rather a standard client, and shows how to connect manually spawned workers (in case someone doesn't have a SLURM cluster but still wants to do similar things). As a starting point one could have a look into this example in Auto-sklearn which can be easily adapted for SMAC.
Thanks a lot @benjamc, that was super quick.
Unfortunately, the example doesn't work on our cluster. I changed the name of the queue and increased the number of trials to 1000 and then ran the process on the login node of our cluster. I can see worker jobs spawning on the cluster but they don't seem to be doing anything. The work is all done on the login node instead (htop
on the login node shows it under full load, htop
on the node where the workers are running shows some activity initially as they start up, then nothing). After a while (the main thread on the login node is still running trials at this point) the workers stop
When I look into the logs in tmp/smac_dask_slurm/*.err
I see the following error. Any idea what I'm doing wrong?
2023-05-11 16:00:40,070 - distributed.nanny - INFO - Closing Nanny at 'tcp://[private IP removed]:37187'. Reason: nanny-close
2023-05-11 16:00:40,072 - distributed.dask_worker - INFO - End worker
...
OSError: Timed out trying to connect to tcp://[public IP removed]:41950 after 30 s
...
RuntimeError: Nanny failed to start.
(edit: simplified long log since it is no longer relevant, see below.)
Ok, I figured out that the nanny was not connecting to the workers because I had to specify the "interface" parameter. Otherwise, the public IP of the login nodes was used which does not accept connections.
I now no longer see the error but the work still seem to be done exclusively on the login node.
I managed to get it to work but I had to do additional changes:
Scenario
, I had to set n_workers
to the number of workers spawned by Dask. I didn't see this in the example, and without it, SMAC used a TargetFunctionRunner
instead of a DaskParallelRunner
, so it ran locally.time.sleep(10)
before the call to optimize()
. Without it, the workers were not ready when the optimization started and the whole process failed with something like "no worker was ever available".Intensifier
, I had to increase retries
a lot. Without this, I often got "Intensifier could not find any new trials." shortly after the optimization started. I don't understand exactly what happened, but it seems like the intensifier schedules some trials, but there are so many workers that all of them can be scheduled in parallel. While they are running, no new trials are scheduled and the queue runs empty.Maybe some of those points are worth adding to the example.
Hi,
thank you for pointing out the thing with scenario.n_workers
. We updated the PR to wrap the runner in a dask runner when either scenario.n_workers > 1
or a dask client is passed. This should be fine now.
However for the rest, the parallelization example runs fine on our machine.
If you still have troubles, you could provide a minimal working example (if you use a slurm cluster).
I could not reproduce the problem in the third point (the one about retries of the intensifier) but the second one (sleep before optimize) is reproducible for me. The script I used is available here:
https://ai.dmi.unibas.ch/_experiments/pommeren/innosuisse/mwe/
benchmarks.py
contains the list of instances and their features.gurobi.py
contains the model (configuration space and trial evaluation function).run_smac.py
contains the actual call to smac, the dask client and so on.setup.sh
shows what software I installed: gurobipy
, dask_jobqueue
, swig
and SMAC on the development branch as of yesterday (the code needs #997, which I merged locally for previous tests, but now it's already on the dev branch).I tried cutting the example down to the essentials but it still optimizes a model that relies on Gurobi and some instances that I unfortunately cannot share. If you have a simpler model, you want me to try instead, I can see if the error still occurs there. So far, if I execute the code as is, everything works fine, but if I remove the time.sleep(10)
in line 61, I get the following output:
[WARNING][abstract_facade.py:192] Provided `dask_client`. Ignore `scenario.n_workers`, directly set `n_workers` in `dask_client`.
[INFO][abstract_initial_design.py:147] Using 0 initial design configurations and 1 additional configurations.
[INFO][abstract_intensifier.py:305] Using only one seed for deterministic scenario.
[WARNING][dask_runner.py:127] No workers are available. This could mean workers crashed. Waiting for new workers...
Traceback (most recent call last):
File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/./run_smac.py", line 62, in <module>
incumbent = smac.optimize()
File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/facade/abstract_facade.py", line 303, in optimize
incumbents = self._optimizer.optimize()
File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/main/smbo.py", line 284, in optimize
self._runner.submit_trial(trial_info=trial_info)
File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/runner/dask_runner.py", line 130, in submit_trial
raise RuntimeError(
RuntimeError: Tried to execute a job, but no worker was ever available.This likely means that a worker crashed or no workers were properly configured.
The first warning about scenario.n_workers
always shows up when using a Dask client, even when not specifying n_workers
, but this shouldn't matter, right?
Hi Florian, it might be that the patience is too low. Currently we do not have this parameter accessible but as a quick fix you can try to set it up in here. Maybe adding 10s already suffices.
The warning is just for information that we use the number of workers specified in the dask client and scenario.n_workers
is ignored.
Thanks. This seems to help but it is somewhat complicated to test.
I'll open new issues for the two problems as you suggested by email.
For future reference: the new issues are #1016 and #1017.
Thanks for the issues, I will close this one then. :)
I stumbled on this issue and saw that it was opened just shortly after I was looking for this, what a lucky conincidence. I would be particularly interested in an example that uses dask to run smac on a slurm-based cluster. From looking at dask, this seems like an option, but I don't know how to use it: https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.SLURMCluster.html