Closed msis closed 3 months ago
Hi,
Can you try specifying it with --gpus=X
or --gpus-per-node=Y
to srun
command when you start a2 instance.
You can find the reference here https://slurm.schedmd.com/srun.html#OPT_gpus
That solves it. I thought because of the instance type, there was no need to set the gpu.
I can confirm that setting gpus
(or gres
) does the job and GPUs a revisible.
Describe the bug
Nodes launched with a modified version of
./examples/ml_slurm.yaml
do not seem to see GPU with CUDASteps to reproduce
Steps to reproduce the behavior:
ml_slurm_a100.yaml
belowsrun --partition a10040g1gpu --pty bash -i
conda activate pytorch
nvidia-smi
or in a python consoleimport torch; torch.cuda.is_available()
Expected behavior
nvidia-smi
should list GPUs available.torch.cuda.is_avaiable()
should returnTrue
Actual behavior
Version (
ghpc --version
)Blueprint
If applicable, attach or paste the blueprint YAML used to produce the bug.
Additional context
Add any other context about the problem here.