Closed crinavar closed 3 years ago
Hello,
Are you using ConstrainDevices=yes
in cgroup.conf?
Is the environment variable CUDA_VISIBLE_DEVICES
set inside the container?
Also, in both cases (1 GPU and 3 GPUs), could you share the output of ls -l /dev/nvidia{0..7}
inside the container?
Also, let's see the output of nvidia-smi -q | grep Minor
(want to see if GPU 0 is /dev/nvidia0
or /dev/nvidia7
).
Can you also confirm that you get the same output with srun -p gpu --gres=gpu:A100:1 nvidia-smi -L
and srun -p gpu --gres=gpu:A100:3 nvidia-smi -L
than when using pyxis/enroot?
Hi flx42, I will reply your two messages in this one
Hello,
Are you using
ConstrainDevices=yes
in cgroup.conf?
Actually not, let me know if we need to define it, here is our cgroup.conf (very minimal)
➜ ~ cat /etc/slurm/cgroup.conf
###
# Slurm cgroup support configuration file
###
CgroupAutomount=yes
ConstrainCores=yes
#TaskAffinity=yes
Is the environment variable
CUDA_VISIBLE_DEVICES
set inside the container?
I have not set it manually at least. If I run a job like this, it is not defined
➜ ~ srun -p gpu --container-name=cuda-11.2.2 --container-image=cuda-11.2.2 --pty --gres=gpu:A100:1 echo $CUDA_VISIBLE_DEVICES
➜ ~
But if one opens a bash session with the container, it does get defined by Slurm I believe
cnavarro@nodeGPU01:~$ echo $CUDA_VISIBLE_DEVICES
2
Also, in both cases (1 GPU and 3 GPUs), could you share the output of
ls -l /dev/nvidia{0..7}
inside the container?
here is the output for 1 GPU and 3 GPUs
➜ ~ srun -p gpu --container-name=cuda-11.2.2 --container-image=cuda-11.2.2 --pty --gres=gpu:A100:1 ls -l /dev/nvidia{0..7}
/usr/bin/ls: cannot access '/dev/nvidia0': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia1': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia3': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia4': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia5': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia6': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia7': No such file or directory
crw-rw-rw- 1 nobody nogroup 195, 2 Apr 18 00:15 /dev/nvidia2
srun: error: nodeGPU01: task 0: Exited with exit code 2
➜ ~ srun -p gpu --container-name=cuda-11.2.2 --container-image=cuda-11.2.2 --pty --gres=gpu:A100:3 ls -l /dev/nvidia{0..7}
/usr/bin/ls: cannot access '/dev/nvidia1': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia4': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia5': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia6': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia7': No such file or directory
crw-rw-rw- 1 nobody nogroup 195, 0 Apr 18 00:15 /dev/nvidia0
crw-rw-rw- 1 nobody nogroup 195, 2 Apr 18 00:15 /dev/nvidia2
crw-rw-rw- 1 nobody nogroup 195, 3 Apr 18 00:15 /dev/nvidia3
srun: error: nodeGPU01: task 0: Exited with exit code 2
Wanted to also share the output of nvidia-smi topo -m
from inside the DGX A100 node natively, we can see that GPUs 2 and 3 are the first ones in order of CPU affinity, then come GPU0 and GPU1. Seems to be the order slurm is exploring. By the way gres.conf is using AutoDetect=nvml
option detecting the same CPU affinities.
➜ ~ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 mlx5_4 mlx5_5 mlx5_6 mlx5_7 mlx5_8 mlx5_9 CPU Affinity NUMA Affinity
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63 3
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63 3
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31 1
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31 1
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127 7
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127 7
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95 5
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95 5
mlx5_0 PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS SYS SYS
mlx5_1 PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS SYS SYS
mlx5_2 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS
mlx5_3 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS
mlx5_4 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS
mlx5_5 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS
mlx5_6 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS
mlx5_7 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS
mlx5_8 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX
mlx5_9 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Also, let's see the output of
nvidia-smi -q | grep Minor
(want to see if GPU 0 is/dev/nvidia0
or/dev/nvidia7
).
With srun + container requesting 1 GPU.
➜ ~ srun -p gpu --container-name=cuda-11.2.2 --container-image=cuda-11.2.2 --gres=gpu:A100:1 --pty nvidia-smi -q | grep Minor
srun: error: ioctl(TIOCGWINSZ): Inappropriate ioctl for device
srun: error: Not using a pseudo-terminal, disregarding --pty option
Minor Number : 2
➜ ~
With just srun (it was slow, took a couple of seconds to query each GPU. By the way this is another minor issue, all GPUs are shown in no-container jobs, but at the time of runnning GPU code, only the requested number of GPUs will actually run and not all from the list.).
➜ ~ srun -p gpu --gres=gpu:A100:1 nvidia-smi -q | grep Minor
Minor Number : 0
Minor Number : 1
Minor Number : 2
Minor Number : 3
Minor Number : 4
Minor Number : 5
Minor Number : 6
Minor Number : 7
Can you also confirm that you get the same output with
srun -p gpu --gres=gpu:A100:1 nvidia-smi -L
andsrun -p gpu --gres=gpu:A100:3 nvidia-smi -L
than when using pyxis/enroot?
I dont get the same output, it is actually what I was mentioning in the previous parenthesis. For some reason, all GPUs get listed even when in reality only the requested number will work and transparently regarding the GPU ID mapping.
➜ ~ srun -p gpu --gres=gpu:A100:1 nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-baa4736e-088f-77ce-0290-ba745327ca95)
GPU 1: A100-SXM4-40GB (UUID: GPU-d40a3b1b-006b-37de-8b72-669c59d14954)
GPU 2: A100-SXM4-40GB (UUID: GPU-35a012ac-2b34-b68f-d922-24aa07af1be6)
GPU 3: A100-SXM4-40GB (UUID: GPU-b75a4bf8-123b-a8c0-dc75-7709626ead20)
GPU 4: A100-SXM4-40GB (UUID: GPU-9366ff9f-a20a-004e-36eb-8376655b1419)
GPU 5: A100-SXM4-40GB (UUID: GPU-75da7cd5-daf3-10fd-2c3f-56259c1dc777)
GPU 6: A100-SXM4-40GB (UUID: GPU-f999e415-54e5-9d7f-0c4b-1d4d98a1dbfc)
GPU 7: A100-SXM4-40GB (UUID: GPU-cce4a787-1b22-bed7-1e93-612906567a0e)
➜ ~ srun -p gpu --gres=gpu:A100:3 nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-baa4736e-088f-77ce-0290-ba745327ca95)
GPU 1: A100-SXM4-40GB (UUID: GPU-d40a3b1b-006b-37de-8b72-669c59d14954)
GPU 2: A100-SXM4-40GB (UUID: GPU-35a012ac-2b34-b68f-d922-24aa07af1be6)
GPU 3: A100-SXM4-40GB (UUID: GPU-b75a4bf8-123b-a8c0-dc75-7709626ead20)
GPU 4: A100-SXM4-40GB (UUID: GPU-9366ff9f-a20a-004e-36eb-8376655b1419)
GPU 5: A100-SXM4-40GB (UUID: GPU-75da7cd5-daf3-10fd-2c3f-56259c1dc777)
GPU 6: A100-SXM4-40GB (UUID: GPU-f999e415-54e5-9d7f-0c4b-1d4d98a1dbfc)
GPU 7: A100-SXM4-40GB (UUID: GPU-cce4a787-1b22-bed7-1e93-612906567a0e)
I have added the ConstrainDevices=yes
in cgroup.conf and it works!. Already tested for several combinations of GPUs requested and multiple jobs at the same time. Seems to have been fixed. Many thanks for the help.
That's great to hear! Closing this bug for now.
However, there seems to be a weird interaction between CUDA_VISIBLE_DEVICES
and enroot (or libnvidia-container). Given that Slurm was using the CUDA_VISIBLE_DEVICES
approach, I was not expecting to see a subset of the /dev/nvidia{0..7}
files inside the container. @3XX0 any idea what happened here?
Hello Nvidia team and community, I first posted this problem on SLURM user mailing list, but given I had no reply and it is specific to pyxis, it might be better to report here.
I am having a strange problem when trying to launch GPU SLURM Jobs with pyxis+enroot plugin. There seems to be a problem when referring to GPU IDs from inside the container, and the mapping to the physical GPUs, giving a runtime error in CUDA. When not using containers, the GPU ID mapping works well (we can have multiple Slurm Jobs, and each one of them will see their own GPU0, GPU1, .... which map to specific physical GPUs that are not necessarily the same indices, which is the correct transparency we should expect from SLURM).
The system is a DGX A100 with the following GPU UUIDs
In the following lines I will try to explain the problem as clearly as possible Doing nvidia-smi gives
As we can see, physical GPU2 is allocated (UUID). From what I understand from the idea of SLURM, the programmer should not need to know that this GPU has a physical GPU ID = 2, he/she can just develop a program thinking on GPU ID 0 for this specific case.
Now, If I launch a containerized job, for example a simple CUDA matrix multiply using the CUDA container, we get the following error
The "index=.." value is the GPU index given by nvml. If we do --gres=gpu:A100:3 (and still using just one GPU), the real first GPU gets allocated, and the program works, but we know this is not the way it should work.
I find that very strange that when using containers, the GPU0 from inside the JOB seems to be trying to access the real physical GPU0 from the machine, and not the GPU0 provided by SLURM.
Many thanks in advance for any advice on this issue -- Cristobal