ComparativeGenomicsToolkit / cactus

Official home of genome aligner based upon notion of Cactus graphs
Other
481 stars 106 forks source link

SlurmBatchSystem "does not support any accelerators" when running on a Slurm GPU cluster #887

Open oneillkza opened 1 year ago

oneillkza commented 1 year ago

@glennhickey I've been trying out the latest code in #884 to enable requesting of accelerators from Toil, but am now getting the following error:

  File "/projects/koneill_prj/conda/envs/cactus/lib/python3.10/site-packages/toil/batchSystems/abstractBatchSystem.py", line 347, in _check_accelerator_request
    raise InsufficientSystemResources(requirer, 'accelerators', [], details=[
toil.batchSystems.abstractBatchSystem.InsufficientSystemResources: The job LastzRepeatMaskJob is requesting [{'count': 1, 'kind': 'gpu', 'api': 'cuda', 'brand': 'nvidia'}] accelerators, more than the maximum of [] accelerators that SlurmBatchSystem was configured with. The batch system does not support any accelerators.

I'm not sure if this isn't maybe an upstream error -- ie Toil just hasn't implemented support for GPU resources on Slurm yet. @adamnovak is that the case? Or is this something I need to set somewhere?

(Note that we run NextFlow on this cluster pretty regularly, and it has no trouble requesting GPUs from the scheduler, and then having individual jobs use the right ones based on $CUDA_VISIBLE_DEVICES)

adamnovak commented 1 year ago

Sorry, I haven't implemented support for GPUs in the Toil SlurmBatchSystem yet. We don't have a Slurm GPU setup at UCSC yet to try it with, although we should be getting one soon.

How does your Slurm cluster do GPUs @oneillkza? It looks like some (all?) clusters use a generic resource (GRES) of gpu to represent Nvidia CUDA-capable GPUs, so Toil could use --gres=gpu:1 to ask for one of those. But there's also a --gpus option that Slurm can make available under some circumstances; should we pass that one instead?

It seems like Slurm also has AMD ROCm support, but that it doesn't really give you a way (beyond the "type" which can be exact model numbers like "a100") to say you want a CUDA API or a ROCm API.

oneillkza commented 1 year ago

Thanks @adamnovak -- yep we use --gres=gpu:1, and I believe the Slurm scheduler sets $CUDA_VISIBLE_DEVICES with the allocated GPUs, which the GPU-enabled software is expected to respect.

(Our cluster is a bunch of servers running NVidia CUDA-capable cards, mainly 3090s, with eight GPUs per node.)

oneillkza commented 1 year ago

@thiagogenez just noting that to run Cactus on a local slurm cluster, this is also necessary (ie using the latest code for Cactus, incorporating https://github.com/ComparativeGenomicsToolkit/cactus/pull/844 as well as waiting for https://github.com/DataBiosphere/toil/issues/4308).

HFzzzzzzz commented 1 year ago

@thiagogenez just noting that to run Cactus on a local slurm cluster, this is also necessary (ie using the latest code for Cactus, incorporating #844 as well as waiting for DataBiosphere/toil#4308).

Hello, I am using the slurm cluster to run cactus, but after I use the module load cactus, the problem of toil_worker, command not found appears, I don’t know if you have encountered it, how did you run it? Thank you so much for your guidance, I'm a newbie and this has been bugging me for ages

thiagogenez commented 1 year ago

Hi @790634750 , Can you share the details of how are you calling Cactus on your Slurm environment and the errors you get, please? Then I can provide you with better answers. Cheers

HFzzzzzzz commented 1 year ago

Hi @790634750 , Can you share the details of how are you calling Cactus on your Slurm environment and the errors you get, please? Then I can provide you with better answers. Cheers

HI,@thiagogenez
You can take a look at the question I raised, #894, first I tried to use modlue load cactus to use the platform's cactus, and then there was a toil_work error, and then I compiled cactus locally in the cluster without success. Could it be that I don't have slurm permissions and can't install some dependencies? I followed the steps below cd cactus virtualenv -p python3 cactus_env echo "export PATH=$(pwd)/bin:\$PATH" >> cactus_env/bin/activate echo "export PYTHONPATH=$(pwd)/lib:\$PYTHONPATH" >> cactus_env/bin/activate source cactus_env/bin/activate python3 -m pip install -U setuptools pip python3 -m pip install -U -r ./toil-requirement.txt python3 -m pip install -U . make An error occurred /home/apps/soft/anaconda3/2019.10/bin/h5c++: line 304: x86_64-conda_cos6-linux-gnu-c++: command not found make[3]: [../objs/api/impl/halAlignmentInstance.o] Error 127 make[3]: Leaving directory `/home/zhouhf/cactus/submodules/hal/api' make[2]: [api.libs] Error 2 make[2]: Leaving directory /home/zhouhf/cactus/submodules/hal' make[1]: *** [suball.hal] Error 2 make[1]: Leaving directory/home/zhouhf/cactus' make: *** [all] Error 2

I use conda install -c anaconda gcc_linux-64, the download fails. I use conda install -c bioconda cactus, but the download also fails. How should I run cactus on a slurm cluster?

thiagogenez commented 1 year ago

You can take a look at the question I raised, #894, first I tried to use modlue load cactus to use the platform's cactus, and then there was a toil_work error, and then I compiled cactus locally in the cluster without success. Could it be that I don't have slurm permissions and can't install some dependencies? I followed the steps below

Hi @790634750 It seems you are getting errors during the Cactus compilation. The error you described is because g++ can't be found in your $PATH. I don't use conda, but I believe the solution for your case is to load hdf5 and gcc modules provided in your cluster environment before run make.

The easiest way journey to run Cactus is the use of containers. I strongly recommended use the Docker image provided

If you have Singularity in your cluster (which I believe you might have), you can use it to run the container. Example:

if you have GPUavailable

singularity pull --name cactus-gpu.sif docker://quay.io/comparative-genomics-toolkit/cactus:v2.4.0-gpu


- Step 3: Run the container
```bash

# if you don't have GPU available
singularity run cactus.sif cactus --help

# if you have GPUavailable
singularity run --nv cactus-gpu.sif cactus --help
HFzzzzzzz commented 1 year ago

You can take a look at the question I raised, #894, first I tried to use modlue load cactus to use the platform's cactus, and then there was a toil_work error, and then I compiled cactus locally in the cluster without success. Could it be that I don't have slurm permissions and can't install some dependencies? I followed the steps below

Hi @790634750 It seems you are getting errors during the Cactus compilation. The error you described is because g++ can't be found in your $PATH. I don't use conda, but I believe the solution for your case is to load hdf5 and gcc modules provided in your cluster environment before run make.

The easiest way journey to run Cactus is the use of containers. I strongly recommended use the Docker image provided

If you have Singularity in your cluster (which I believe you might have), you can use it to run the container. Example:

  • Step 1:
# load singularity module provided by your cluster
  • Step 2: to download the container
# if you don't have GPU available
singularity pull --name cactus.sif docker://quay.io/comparative-genomics-toolkit/cactus:v2.4.0

# if you have GPUavailable
singularity pull --name cactus-gpu.sif docker://quay.io/comparative-genomics-toolkit/cactus:v2.4.0-gpu
  • Step 3: Run the container
# if you don't have GPU available
singularity run cactus.sif cactus --help

# if you have GPUavailable
singularity run --nv cactus-gpu.sif cactus --help

HI @thiagogenez
Thank you very much for your answer, but currently our slurm does not have singluarity, but slurm has different versions of cactus, I can use module load cactus, but when I run it like this, there will be an error of toil_worker: command not found, how should I solve it?

thiagogenez commented 1 year ago

HI @thiagogenez Thank you very much for your answer, but currently our slurm does not have singluarity, but slurm has different versions of cactus, I can use module load cactus, but when I run it like this, there will be an error of toil_worker: command not found, how should I solve it?

I believe there is a misconfiguration of the cactus module in your cluster. The toil_worker binary should be found inside the Cactus python environment.

glennhickey commented 1 year ago

To install the Cactus Python module, download the Cactus binaries here: https://github.com/ComparativeGenomicsToolkit/cactus/releases and install using the linked instructions: BIN_INSTALL.md

You should not need to apt install anything except maybe python3-dev (if one of the pip install commands gives an error). You definitely do not want to be following the "Installing Manually From Source" instructions unless you have a really good reason to be doing so.