gpus on SlurmCluster - Githubissues

Description

Running a SlurmCluster with example7 as a basis works, but when I add job_extra_directives=["--gres=gpu:2"] and send a torch tensor to('cuda:0'), it crashes. It may be related to this:

Warning On some clusters you cannot spawn new jobs when running a SLURMCluster inside a job instead of on the login node. No obvious errors might be raised but it can hang silently.

But, I used .to('cuda:0') to make it less silent.

Steps/Code to Reproduce

"""
Parallelization-on-Cluster
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

An example of applying SMAC to optimize Branin using parallelization via Dask client on a
SLURM cluster. If you do not want to use a cluster but your local machine, set dask_client
to `None` and pass `n_workers` to the `Scenario`.

:warning: On some clusters you cannot spawn new jobs when running a SLURMCluster inside a
job instead of on the login node. No obvious errors might be raised but it can hang silently.

Sometimes you need to modify your launch command which can be done with
`SLURMCluster.job_class.submit_command`.

```python
cluster.job_cls.submit_command = submit_command
cluster.job_cls.cancel_command = cancel_command

Here we optimize the synthetic 2d function Branin. We use the black-box facade because it is designed for black-box function optimization. The black-box facade uses a :term:Gaussian Process<GP> as its surrogate model. The facade works best on a numerical hyperparameter configuration space and should not be applied to problems with large evaluation budgets (up to 1000 evaluations). """

import numpy as np from ConfigSpace import Configuration, ConfigurationSpace, Float from dask.distributed import Client from dask_jobqueue import SLURMCluster

from smac import BlackBoxFacade, Scenario

import torch

class Branin(object): @property def configspace(self) -> ConfigurationSpace: cs = ConfigurationSpace(seed=0) x0 = Float("x0", (-5, 10), default=-5, log=False) x1 = Float("x1", (0, 15), default=2, log=False) cs.add_hyperparameters([x0, x1])

    return cs

def train(self, config: Configuration, seed: int = 0) -> float:

#def gpu_checks(self, seed: int = 0, budget: int = 25):
    # setting device on GPU if available, else CPU
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print('Using device:', device) # prints Using device: cpu

    b = torch.randn((4, 5))
    b.to('cuda:0')

    x0 = config["x0"]
    x1 = config["x1"]
    a = 1.0
    b = 5.1 / (4.0 * np.pi**2)
    c = 5.0 / np.pi
    r = 6.0
    s = 10.0
    t = 1.0 / (8.0 * np.pi)
    ret = a * (x1 - b * x0**2 + c * x0 - r) ** 2 + s * (1 - t) * np.cos(x0) + s

    return ret

if name == "main": model = Branin()

# Scenario object specifying the optimization "environment"
scenario = Scenario(model.configspace, deterministic=True, n_trials=100)

n_workers = 2  # Use 4 workers on the cluster

cluster = SLURMCluster(
    # This is the partition of our slurm cluster.
    queue="..."
    cores=1,
    memory="1 GB",
    walltime="00:10:00",
    processes=1,
    log_directory="tmp/smac_dask_slurm",
    #worker_extra_args=["--gpus-per-task=2"],
    job_extra_directives=["--gres=gpu:2"]
)
cluster.scale(jobs=n_workers)
print(cluster.job_script())

# Dask will create n_workers jobs on the cluster which stay open.
# Then, SMAC/Dask will schedule individual runs
#   on the workers like on your local machine.
#client = Client(
#    address=cluster,
#)
# Instead, you can also do
client = cluster.get_client()

# Now we use SMAC to find the best hyperparameters
smac = BlackBoxFacade(
    scenario,
    model.train,  # We pass the target function here
    overwrite=True,  # Overrides any previous results that are found that are inconsistent with the meta-data
    dask_client=client,
)

incumbent = smac.optimize()

# Get cost of default configuration
default_cost = smac.validate(model.configspace.get_default_configuration())
print(f"Default cost: {default_cost}")

# Let's calculate the cost of the incumbent
incumbent_cost = smac.validate(incumbent)
print(f"Incumbent cost: {incumbent_cost}")


#### Expected Results
Using device: cuda

#### Actual Results
Using device: cpu /crash

Traceback (most recent call last):
  File "home/example70.py", line 135, in <module>
    incumbent = smac.optimize()
  File "/home/venv/lib/python3.10/site-packages/smac/facade/abstract_facade.py", line 319, in optimize
    incumbents = self._optimizer.optimize(data_to_scatter=data_to_scatter)
  File "/home/venv/lib/python3.10/site-packages/smac/main/smbo.py", line 304, in optimize
    self._runner.submit_trial(trial_info=trial_info, **dask_data_to_scatter)
  File "/home/venv/lib/python3.10/site-packages/smac/runner/dask_runner.py", line 141, in submit_trial
    self._process_pending_trials()
  File "/home/venv/lib/python3.10/site-packages/smac/runner/dask_runner.py", line 208, in _process_pending_trials
    self._results_queue.append(trial.result())
  File "/home/venv/lib/python3.10/site-packages/distributed/client.py", line 320, in result
    return self.client.sync(self._result, callback_timeout=timeout)
  File "/home/venv/lib/python3.10/site-packages/distributed/client.py", line 328, in _result
    raise exc.with_traceback(tb)
distributed.scheduler.KilledWorker: Attempted to run task run_wrapper-59c7552af3ee317d9a0d09c069418d5b on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://10.5.166.193:37123. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.
#### Versions
`AttributeError: module 'smac' has no attribute '__version__'`
2.02 (from pip)

automl / SMAC3

gpus on SlurmCluster #1075

Description

Steps/Code to Reproduce