automl / SMAC3

SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Optimization
https://automl.github.io/SMAC3/v2.1.0/
Other
1.08k stars 224 forks source link

Create example for custom dask client #998

Closed benjamc closed 1 year ago

FlorianPommerening commented 1 year ago

I stumbled on this issue and saw that it was opened just shortly after I was looking for this, what a lucky conincidence. I would be particularly interested in an example that uses dask to run smac on a slurm-based cluster. From looking at dask, this seems like an option, but I don't know how to use it: https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.SLURMCluster.html

benjamc commented 1 year ago

Hi @FlorianPommerening , I created a dask client example for a SLURM cluster, you can find it in this PR #1001 under examples/1_basics/7_parallelization_cluster.py.

mfeurer commented 1 year ago

Hey @benjamc, it's great to see progress in this direction.

I would suggest to also add an example that does not require a custom client, but rather a standard client, and shows how to connect manually spawned workers (in case someone doesn't have a SLURM cluster but still wants to do similar things). As a starting point one could have a look into this example in Auto-sklearn which can be easily adapted for SMAC.

FlorianPommerening commented 1 year ago

Thanks a lot @benjamc, that was super quick.

FlorianPommerening commented 1 year ago

Unfortunately, the example doesn't work on our cluster. I changed the name of the queue and increased the number of trials to 1000 and then ran the process on the login node of our cluster. I can see worker jobs spawning on the cluster but they don't seem to be doing anything. The work is all done on the login node instead (htop on the login node shows it under full load, htop on the node where the workers are running shows some activity initially as they start up, then nothing). After a while (the main thread on the login node is still running trials at this point) the workers stop

When I look into the logs in tmp/smac_dask_slurm/*.err I see the following error. Any idea what I'm doing wrong?

2023-05-11 16:00:40,070 - distributed.nanny - INFO - Closing Nanny at 'tcp://[private IP removed]:37187'. Reason: nanny-close
2023-05-11 16:00:40,072 - distributed.dask_worker - INFO - End worker
...
OSError: Timed out trying to connect to tcp://[public IP removed]:41950 after 30 s
...
RuntimeError: Nanny failed to start.

(edit: simplified long log since it is no longer relevant, see below.)

FlorianPommerening commented 1 year ago

Ok, I figured out that the nanny was not connecting to the workers because I had to specify the "interface" parameter. Otherwise, the public IP of the login nodes was used which does not accept connections.

I now no longer see the error but the work still seem to be done exclusively on the login node.

FlorianPommerening commented 1 year ago

I managed to get it to work but I had to do additional changes:

Maybe some of those points are worth adding to the example.

benjamc commented 1 year ago

Hi, thank you for pointing out the thing with scenario.n_workers. We updated the PR to wrap the runner in a dask runner when either scenario.n_workers > 1 or a dask client is passed. This should be fine now. However for the rest, the parallelization example runs fine on our machine. If you still have troubles, you could provide a minimal working example (if you use a slurm cluster).

FlorianPommerening commented 1 year ago

I could not reproduce the problem in the third point (the one about retries of the intensifier) but the second one (sleep before optimize) is reproducible for me. The script I used is available here:

https://ai.dmi.unibas.ch/_experiments/pommeren/innosuisse/mwe/

I tried cutting the example down to the essentials but it still optimizes a model that relies on Gurobi and some instances that I unfortunately cannot share. If you have a simpler model, you want me to try instead, I can see if the error still occurs there. So far, if I execute the code as is, everything works fine, but if I remove the time.sleep(10) in line 61, I get the following output:

[WARNING][abstract_facade.py:192] Provided `dask_client`. Ignore `scenario.n_workers`, directly set `n_workers` in `dask_client`.
[INFO][abstract_initial_design.py:147] Using 0 initial design configurations and 1 additional configurations.
[INFO][abstract_intensifier.py:305] Using only one seed for deterministic scenario.
[WARNING][dask_runner.py:127] No workers are available. This could mean workers crashed. Waiting for new workers...
Traceback (most recent call last):
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/./run_smac.py", line 62, in <module>
    incumbent = smac.optimize()
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/facade/abstract_facade.py", line 303, in optimize
    incumbents = self._optimizer.optimize()
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/main/smbo.py", line 284, in optimize
    self._runner.submit_trial(trial_info=trial_info)
  File "/infai/pommeren/experiments/pommeren/innosuisse/mwe/SMAC3/smac/runner/dask_runner.py", line 130, in submit_trial
    raise RuntimeError(
RuntimeError: Tried to execute a job, but no worker was ever available.This likely means that a worker crashed or no workers were properly configured.

The first warning about scenario.n_workers always shows up when using a Dask client, even when not specifying n_workers, but this shouldn't matter, right?

benjamc commented 1 year ago

Hi Florian, it might be that the patience is too low. Currently we do not have this parameter accessible but as a quick fix you can try to set it up in here. Maybe adding 10s already suffices.

The warning is just for information that we use the number of workers specified in the dask client and scenario.n_workers is ignored.

FlorianPommerening commented 1 year ago

Thanks. This seems to help but it is somewhat complicated to test.

I'll open new issues for the two problems as you suggested by email.

FlorianPommerening commented 1 year ago

For future reference: the new issues are #1016 and #1017.

benjamc commented 1 year ago

Thanks for the issues, I will close this one then. :)