Running a SlurmCluster with example7 as a basis works, but when I add job_extra_directives=["--gres=gpu:2"]
and send a torch tensor to('cuda:0'), it crashes. It may be related to this:
Warning
On some clusters you cannot spawn new jobs when running a SLURMCluster inside a job instead of on the login node. No obvious errors might be raised but it can hang silently.
But, I used .to('cuda:0') to make it less silent.
Steps/Code to Reproduce
"""
Parallelization-on-Cluster
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
An example of applying SMAC to optimize Branin using parallelization via Dask client on a
SLURM cluster. If you do not want to use a cluster but your local machine, set dask_client
to `None` and pass `n_workers` to the `Scenario`.
:warning: On some clusters you cannot spawn new jobs when running a SLURMCluster inside a
job instead of on the login node. No obvious errors might be raised but it can hang silently.
Sometimes you need to modify your launch command which can be done with
`SLURMCluster.job_class.submit_command`.
```python
cluster.job_cls.submit_command = submit_command
cluster.job_cls.cancel_command = cancel_command
Here we optimize the synthetic 2d function Branin.
We use the black-box facade because it is designed for black-box function optimization.
The black-box facade uses a :term:Gaussian Process<GP> as its surrogate model.
The facade works best on a numerical hyperparameter configuration space and should not
be applied to problems with large evaluation budgets (up to 1000 evaluations).
"""
import numpy as np
from ConfigSpace import Configuration, ConfigurationSpace, Float
from dask.distributed import Client
from dask_jobqueue import SLURMCluster
return cs
def train(self, config: Configuration, seed: int = 0) -> float:
#def gpu_checks(self, seed: int = 0, budget: int = 25):
# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device) # prints Using device: cpu
b = torch.randn((4, 5))
b.to('cuda:0')
x0 = config["x0"]
x1 = config["x1"]
a = 1.0
b = 5.1 / (4.0 * np.pi**2)
c = 5.0 / np.pi
r = 6.0
s = 10.0
t = 1.0 / (8.0 * np.pi)
ret = a * (x1 - b * x0**2 + c * x0 - r) ** 2 + s * (1 - t) * np.cos(x0) + s
return ret
if name == "main":
model = Branin()
# Scenario object specifying the optimization "environment"
scenario = Scenario(model.configspace, deterministic=True, n_trials=100)
n_workers = 2 # Use 4 workers on the cluster
cluster = SLURMCluster(
# This is the partition of our slurm cluster.
queue="..."
cores=1,
memory="1 GB",
walltime="00:10:00",
processes=1,
log_directory="tmp/smac_dask_slurm",
#worker_extra_args=["--gpus-per-task=2"],
job_extra_directives=["--gres=gpu:2"]
)
cluster.scale(jobs=n_workers)
print(cluster.job_script())
# Dask will create n_workers jobs on the cluster which stay open.
# Then, SMAC/Dask will schedule individual runs
# on the workers like on your local machine.
#client = Client(
# address=cluster,
#)
# Instead, you can also do
client = cluster.get_client()
# Now we use SMAC to find the best hyperparameters
smac = BlackBoxFacade(
scenario,
model.train, # We pass the target function here
overwrite=True, # Overrides any previous results that are found that are inconsistent with the meta-data
dask_client=client,
)
incumbent = smac.optimize()
# Get cost of default configuration
default_cost = smac.validate(model.configspace.get_default_configuration())
print(f"Default cost: {default_cost}")
# Let's calculate the cost of the incumbent
incumbent_cost = smac.validate(incumbent)
print(f"Incumbent cost: {incumbent_cost}")
#### Expected Results
Using device: cuda
#### Actual Results
Using device: cpu /crash
Traceback (most recent call last):
File "home/example70.py", line 135, in <module>
incumbent = smac.optimize()
File "/home/venv/lib/python3.10/site-packages/smac/facade/abstract_facade.py", line 319, in optimize
incumbents = self._optimizer.optimize(data_to_scatter=data_to_scatter)
File "/home/venv/lib/python3.10/site-packages/smac/main/smbo.py", line 304, in optimize
self._runner.submit_trial(trial_info=trial_info, **dask_data_to_scatter)
File "/home/venv/lib/python3.10/site-packages/smac/runner/dask_runner.py", line 141, in submit_trial
self._process_pending_trials()
File "/home/venv/lib/python3.10/site-packages/smac/runner/dask_runner.py", line 208, in _process_pending_trials
self._results_queue.append(trial.result())
File "/home/venv/lib/python3.10/site-packages/distributed/client.py", line 320, in result
return self.client.sync(self._result, callback_timeout=timeout)
File "/home/venv/lib/python3.10/site-packages/distributed/client.py", line 328, in _result
raise exc.with_traceback(tb)
distributed.scheduler.KilledWorker: Attempted to run task run_wrapper-59c7552af3ee317d9a0d09c069418d5b on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://10.5.166.193:37123. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.
#### Versions
`AttributeError: module 'smac' has no attribute '__version__'`
2.02 (from pip)
Description
Running a SlurmCluster with example7 as a basis works, but when I add job_extra_directives=["--gres=gpu:2"] and send a torch tensor
to('cuda:0')
, it crashes. It may be related to this:Warning On some clusters you cannot spawn new jobs when running a SLURMCluster inside a job instead of on the login node. No obvious errors might be raised but it can hang silently.
But, I used
.to('cuda:0')
to make it less silent.Steps/Code to Reproduce
Here we optimize the synthetic 2d function Branin. We use the black-box facade because it is designed for black-box function optimization. The black-box facade uses a :term:
Gaussian Process<GP>
as its surrogate model. The facade works best on a numerical hyperparameter configuration space and should not be applied to problems with large evaluation budgets (up to 1000 evaluations). """import numpy as np from ConfigSpace import Configuration, ConfigurationSpace, Float from dask.distributed import Client from dask_jobqueue import SLURMCluster
from smac import BlackBoxFacade, Scenario
import torch
copyright = "Copyright 2023, AutoML.org Freiburg-Hannover" license = "3-clause BSD"
class Branin(object): @property def configspace(self) -> ConfigurationSpace: cs = ConfigurationSpace(seed=0) x0 = Float("x0", (-5, 10), default=-5, log=False) x1 = Float("x1", (0, 15), default=2, log=False) cs.add_hyperparameters([x0, x1])
if name == "main": model = Branin()