dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 719 forks source link

DASK Deployment using SLURM with GPUs #8857

Closed AquifersBSIM closed 2 months ago

AquifersBSIM commented 2 months ago

Describe the issue: I am running into an issue with deploying dask using LocalCUDACluster() on an HPC. I am trying to do RandomForest, and the amount of data I am inputting exits the limit of a single GPU. Hence, I am trying to utilize several GPUs to split the datasets. To start with I did, the following is just an example script (from DASK GitHub front page) which is shown in the code:

Minimal Complete Verifiable Example:

import glob

def main():

    # Read CSV file in parallel across workers
    import dask_cudf
    df = dask_cudf.read_csv(glob.glob("*.csv"))

    # Fit a NearestNeighbors model and query it
    from cuml.dask.neighbors import NearestNeighbors
    nn = NearestNeighbors(n_neighbors = 10, client=client)
    nn.fit(df)
    neighbors = nn.kneighbors(df)

if __name__ == "__main__":

    # Initialize UCX for high-speed transport of CUDA arrays
    from dask_cuda import LocalCUDACluster

    # Create a Dask single-node CUDA cluster w/ one worker per device
    cluster = LocalCUDACluster()

    from dask.distributed import Client
    client = Client(cluster)

    main()

In addition to that, I have this submission script

#!/bin/bash
#
#SBATCH --job-name=dask_examples
#SBATCH --output=dask_examples.txt
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --ntasks-per-node=1
#SBATCH --time=01:00:00
#SBATCH --mem-per-cpu=5G
#SBATCH --gres=gpu:4
ml conda
conda activate /fred/oz241/BSIM/conda_SVM/SVM
/usr/bin/time -v python 1.py

Error Message

Task exception was never retrieved
future: <Task finished name='Task-543' coro=<_wrap_awaitable() done, defined at /fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/depl
oy/spec.py:124> exception=RuntimeError('Worker failed to start.')>
Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1474, in start_unsafe
    raise plugins_exceptions[0]
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 837, in wrapper
    return await func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1876, in plugin_add
    result = plugin.setup(worker=self)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
    ^^^^^^^^^^^^^^^^^
OSError: [Errno 22] Invalid argument

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/deploy/spec.py", line 125, in _wrap_awaitable
    return await aw
           ^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 512, in start
    raise self.__startup_exc
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
           ^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 369, in start_unsafe
    response = await self.instantiate()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 452, in instantiate
    result = await self.process.start()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 759, in start
    msg = await self._wait_until_connected(uid)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 901, in _wait_until_connected
    raise msg["exception"]
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 965, in run
    async with worker:
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 537, in __aenter__
    await self
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 531, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
    ^^^^^^^^^^^^^^^^^
RuntimeError: Worker failed to start.
Task exception was never retrieved
future: <Task finished name='Task-541' coro=<_wrap_awaitable() done, defined at /fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/depl
oy/spec.py:124> exception=RuntimeError('Worker failed to start.')>
Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1474, in start_unsafe
    raise plugins_exceptions[0]
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 837, in wrapper
    return await func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/worker.py", line 1876, in plugin_add
    result = plugin.setup(worker=self)
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/dask_cuda/plugins.py", line 14, in setup
    os.sched_setaffinity(0, self.cores)
    ^^^^^^^^^^^^^^^^^
OSError: [Errno 22] Invalid argument

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/deploy/spec.py", line 125, in _wrap_awaitable
    return await aw
           ^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 512, in start
    raise self.__startup_exc
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 523, in start
    await wait_for(self.start_unsafe(), timeout=timeout)
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/utils.py", line 1952, in wait_for
    return await fut
           ^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 369, in start_unsafe
    response = await self.instantiate()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 452, in instantiate
    result = await self.process.start()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 759, in start
    msg = await self._wait_until_connected(uid)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 901, in _wait_until_connected
    raise msg["exception"]
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/nanny.py", line 965, in run
    async with worker:
    ^^^^^^^^^^^^^^^^^
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 537, in __aenter__
    await self
  File "/fred/oz241/BSIM/conda_SVM/SVM/lib/python3.11/site-packages/distributed/core.py", line 531, in start
    raise RuntimeError(f"{type(self).__name__} failed to start.") from exc
    ^^^^^^^^^^^^^^^^^
RuntimeError: Worker failed to start.

Anything else we need to know?: The traceback was pretty long, I gave only a snippet of it

Environment:

jacobtomlinson commented 2 months ago

Given that this is related to LocalCUDACluster I would recommend opening this issue on https://github.com/rapidsai/dask-cuda instead. Unfortunately GitHub doesn't allow me to transfer this between orgs.

AquifersBSIM commented 2 months ago

Thank you so much @jacobtomlinson for pointing me.

jrbourbeau commented 2 months ago

Thanks @AquifersBSIM @jacobtomlinson. Closing this in favor of https://github.com/rapidsai/dask-cuda/issues/1381