NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
282 stars 31 forks source link

SLURM + pyxis-enroot, "no CUDA-capable device is detected" #47

Closed crinavar closed 3 years ago

crinavar commented 3 years ago

Hello Nvidia team and community, I first posted this problem on SLURM user mailing list, but given I had no reply and it is specific to pyxis, it might be better to report here.

I am having a strange problem when trying to launch GPU SLURM Jobs with pyxis+enroot plugin. There seems to be a problem when referring to GPU IDs from inside the container, and the mapping to the physical GPUs, giving a runtime error in CUDA. When not using containers, the GPU ID mapping works well (we can have multiple Slurm Jobs, and each one of them will see their own GPU0, GPU1, .... which map to specific physical GPUs that are not necessarily the same indices, which is the correct transparency we should expect from SLURM).

The system is a DGX A100 with the following GPU UUIDs

➜  ~ nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-baa4736e-088f-77ce-0290-ba745327ca95)
GPU 1: A100-SXM4-40GB (UUID: GPU-d40a3b1b-006b-37de-8b72-669c59d14954)
GPU 2: A100-SXM4-40GB (UUID: GPU-35a012ac-2b34-b68f-d922-24aa07af1be6)
GPU 3: A100-SXM4-40GB (UUID: GPU-b75a4bf8-123b-a8c0-dc75-7709626ead20)
GPU 4: A100-SXM4-40GB (UUID: GPU-9366ff9f-a20a-004e-36eb-8376655b1419)
GPU 5: A100-SXM4-40GB (UUID: GPU-75da7cd5-daf3-10fd-2c3f-56259c1dc777)
GPU 6: A100-SXM4-40GB (UUID: GPU-f999e415-54e5-9d7f-0c4b-1d4d98a1dbfc)
GPU 7: A100-SXM4-40GB (UUID: GPU-cce4a787-1b22-bed7-1e93-612906567a0e)

In the following lines I will try to explain the problem as clearly as possible Doing nvidia-smi gives

srun -p gpu --container-name=cuda-11.2.2 --container-image=cuda-11.2.2 --pty --gres=gpu:A100:1 nvidia-smi -L          
GPU 0: A100-SXM4-40GB (UUID: GPU-35a012ac-2b34-b68f-d922-24aa07af1be6)

As we can see, physical GPU2 is allocated (UUID). From what I understand from the idea of SLURM, the programmer should not need to know that this GPU has a physical GPU ID = 2, he/she can just develop a program thinking on GPU ID 0 for this specific case.

Now, If I launch a containerized job, for example a simple CUDA matrix multiply using the CUDA container, we get the following error

srun -p gpu --container-name=cuda-11.2.2 --container-image=cuda-11.2.2 --pty --gres=gpu:A100:1 ./prog 0 $((1024*40)) 1
Driver version: 450.102.04
NUM GPUS = 1
Listing devices:
GPU0 A100-SXM4-40GB, index=0, UUID=GPU-35a012ac-2b34-b68f-d922-24aa07af1be6  -> util = 0%
Choosing GPU 0
GPUassert: no CUDA-capable device is detected main.cu 112
srun: error: nodeGPU01: task 0: Exited with exit code 100

The "index=.." value is the GPU index given by nvml. If we do --gres=gpu:A100:3 (and still using just one GPU), the real first GPU gets allocated, and the program works, but we know this is not the way it should work.

srun -p gpu --container-name=cuda-11.2.2 --container-image=cuda-11.2.2 --pty --gres=gpu:A100:3 ./prog 0 $((1024*40)) 1
Driver version: 450.102.04
NUM GPUS = 3
Listing devices:
GPU0 A100-SXM4-40GB, index=0, UUID=GPU-baa4736e-088f-77ce-0290-ba745327ca95  -> util = 0%
GPU1 A100-SXM4-40GB, index=1, UUID=GPU-35a012ac-2b34-b68f-d922-24aa07af1be6  -> util = 0%
GPU2 A100-SXM4-40GB, index=2, UUID=GPU-b75a4bf8-123b-a8c0-dc75-7709626ead20  -> util = 0%
Choosing GPU 0
initializing A and B.......done
matmul shared mem..........done: time: 26.546274 secs
copying result to host.....done
verifying result...........done

I find that very strange that when using containers, the GPU0 from inside the JOB seems to be trying to access the real physical GPU0 from the machine, and not the GPU0 provided by SLURM.

Many thanks in advance for any advice on this issue -- Cristobal

flx42 commented 3 years ago

Hello,

Are you using ConstrainDevices=yes in cgroup.conf? Is the environment variable CUDA_VISIBLE_DEVICES set inside the container?

Also, in both cases (1 GPU and 3 GPUs), could you share the output of ls -l /dev/nvidia{0..7} inside the container?

flx42 commented 3 years ago

Also, let's see the output of nvidia-smi -q | grep Minor (want to see if GPU 0 is /dev/nvidia0 or /dev/nvidia7).

Can you also confirm that you get the same output with srun -p gpu --gres=gpu:A100:1 nvidia-smi -L and srun -p gpu --gres=gpu:A100:3 nvidia-smi -L than when using pyxis/enroot?

crinavar commented 3 years ago

Hi flx42, I will reply your two messages in this one

Hello,

Are you using ConstrainDevices=yes in cgroup.conf?

Actually not, let me know if we need to define it, here is our cgroup.conf (very minimal)

➜  ~ cat /etc/slurm/cgroup.conf
###
# Slurm cgroup support configuration file
###
CgroupAutomount=yes
ConstrainCores=yes
#TaskAffinity=yes

Is the environment variable CUDA_VISIBLE_DEVICES set inside the container?

I have not set it manually at least. If I run a job like this, it is not defined

➜  ~ srun -p gpu --container-name=cuda-11.2.2 --container-image=cuda-11.2.2 --pty --gres=gpu:A100:1 echo $CUDA_VISIBLE_DEVICES

➜  ~ 

But if one opens a bash session with the container, it does get defined by Slurm I believe

cnavarro@nodeGPU01:~$ echo $CUDA_VISIBLE_DEVICES 
2

Also, in both cases (1 GPU and 3 GPUs), could you share the output of ls -l /dev/nvidia{0..7} inside the container?

here is the output for 1 GPU and 3 GPUs

➜  ~ srun -p gpu --container-name=cuda-11.2.2 --container-image=cuda-11.2.2 --pty --gres=gpu:A100:1 ls -l /dev/nvidia{0..7}  
/usr/bin/ls: cannot access '/dev/nvidia0': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia1': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia3': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia4': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia5': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia6': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia7': No such file or directory
crw-rw-rw- 1 nobody nogroup 195, 2 Apr 18 00:15 /dev/nvidia2
srun: error: nodeGPU01: task 0: Exited with exit code 2
➜  ~ srun -p gpu --container-name=cuda-11.2.2 --container-image=cuda-11.2.2 --pty --gres=gpu:A100:3 ls -l /dev/nvidia{0..7}
/usr/bin/ls: cannot access '/dev/nvidia1': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia4': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia5': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia6': No such file or directory
/usr/bin/ls: cannot access '/dev/nvidia7': No such file or directory
crw-rw-rw- 1 nobody nogroup 195, 0 Apr 18 00:15 /dev/nvidia0
crw-rw-rw- 1 nobody nogroup 195, 2 Apr 18 00:15 /dev/nvidia2
crw-rw-rw- 1 nobody nogroup 195, 3 Apr 18 00:15 /dev/nvidia3
srun: error: nodeGPU01: task 0: Exited with exit code 2

Wanted to also share the output of nvidia-smi topo -m from inside the DGX A100 node natively, we can see that GPUs 2 and 3 are the first ones in order of CPU affinity, then come GPU0 and GPU1. Seems to be the order slurm is exploring. By the way gres.conf is using AutoDetect=nvml option detecting the same CPU affinities.

➜  ~ nvidia-smi topo -m
    GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  mlx5_4  mlx5_5  mlx5_6  mlx5_7  mlx5_8  mlx5_9  CPU Affinity    NUMA Affinity
GPU0     X  NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63   3
GPU1    NV12     X  NV12    NV12    NV12    NV12    NV12    NV12    PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63   3
GPU2    NV12    NV12     X  NV12    NV12    NV12    NV12    NV12    SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31   1
GPU3    NV12    NV12    NV12     X  NV12    NV12    NV12    NV12    SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31   1
GPU4    NV12    NV12    NV12    NV12     X  NV12    NV12    NV12    SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127 7
GPU5    NV12    NV12    NV12    NV12    NV12     X  NV12    NV12    SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127 7
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X  NV12    SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95   5
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X  SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95   5
mlx5_0  PXB PXB SYS SYS SYS SYS SYS SYS  X  PXB SYS SYS SYS SYS SYS SYS SYS SYS     
mlx5_1  PXB PXB SYS SYS SYS SYS SYS SYS PXB  X  SYS SYS SYS SYS SYS SYS SYS SYS     
mlx5_2  SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS  X  PXB SYS SYS SYS SYS SYS SYS     
mlx5_3  SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB  X  SYS SYS SYS SYS SYS SYS     
mlx5_4  SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS  X  PXB SYS SYS SYS SYS     
mlx5_5  SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB  X  SYS SYS SYS SYS     
mlx5_6  SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS  X  PXB SYS SYS     
mlx5_7  SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PXB  X  SYS SYS     
mlx5_8  SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS  X  PIX     
mlx5_9  SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX  X      

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Also, let's see the output of nvidia-smi -q | grep Minor (want to see if GPU 0 is /dev/nvidia0 or /dev/nvidia7).

With srun + container requesting 1 GPU.

➜  ~ srun -p gpu --container-name=cuda-11.2.2 --container-image=cuda-11.2.2 --gres=gpu:A100:1 --pty nvidia-smi -q | grep Minor 
srun: error: ioctl(TIOCGWINSZ): Inappropriate ioctl for device
srun: error: Not using a pseudo-terminal, disregarding --pty option
    Minor Number                          : 2
➜  ~ 

With just srun (it was slow, took a couple of seconds to query each GPU. By the way this is another minor issue, all GPUs are shown in no-container jobs, but at the time of runnning GPU code, only the requested number of GPUs will actually run and not all from the list.).

➜  ~ srun -p gpu --gres=gpu:A100:1 nvidia-smi -q | grep Minor
    Minor Number                          : 0
    Minor Number                          : 1
    Minor Number                          : 2
    Minor Number                          : 3
    Minor Number                          : 4
    Minor Number                          : 5
    Minor Number                          : 6
    Minor Number                          : 7

Can you also confirm that you get the same output with srun -p gpu --gres=gpu:A100:1 nvidia-smi -L and srun -p gpu --gres=gpu:A100:3 nvidia-smi -L than when using pyxis/enroot?

I dont get the same output, it is actually what I was mentioning in the previous parenthesis. For some reason, all GPUs get listed even when in reality only the requested number will work and transparently regarding the GPU ID mapping.

➜  ~ srun -p gpu --gres=gpu:A100:1 nvidia-smi -L             
GPU 0: A100-SXM4-40GB (UUID: GPU-baa4736e-088f-77ce-0290-ba745327ca95)
GPU 1: A100-SXM4-40GB (UUID: GPU-d40a3b1b-006b-37de-8b72-669c59d14954)
GPU 2: A100-SXM4-40GB (UUID: GPU-35a012ac-2b34-b68f-d922-24aa07af1be6)
GPU 3: A100-SXM4-40GB (UUID: GPU-b75a4bf8-123b-a8c0-dc75-7709626ead20)
GPU 4: A100-SXM4-40GB (UUID: GPU-9366ff9f-a20a-004e-36eb-8376655b1419)
GPU 5: A100-SXM4-40GB (UUID: GPU-75da7cd5-daf3-10fd-2c3f-56259c1dc777)
GPU 6: A100-SXM4-40GB (UUID: GPU-f999e415-54e5-9d7f-0c4b-1d4d98a1dbfc)
GPU 7: A100-SXM4-40GB (UUID: GPU-cce4a787-1b22-bed7-1e93-612906567a0e)
➜  ~ srun -p gpu --gres=gpu:A100:3 nvidia-smi -L
GPU 0: A100-SXM4-40GB (UUID: GPU-baa4736e-088f-77ce-0290-ba745327ca95)
GPU 1: A100-SXM4-40GB (UUID: GPU-d40a3b1b-006b-37de-8b72-669c59d14954)
GPU 2: A100-SXM4-40GB (UUID: GPU-35a012ac-2b34-b68f-d922-24aa07af1be6)
GPU 3: A100-SXM4-40GB (UUID: GPU-b75a4bf8-123b-a8c0-dc75-7709626ead20)
GPU 4: A100-SXM4-40GB (UUID: GPU-9366ff9f-a20a-004e-36eb-8376655b1419)
GPU 5: A100-SXM4-40GB (UUID: GPU-75da7cd5-daf3-10fd-2c3f-56259c1dc777)
GPU 6: A100-SXM4-40GB (UUID: GPU-f999e415-54e5-9d7f-0c4b-1d4d98a1dbfc)
GPU 7: A100-SXM4-40GB (UUID: GPU-cce4a787-1b22-bed7-1e93-612906567a0e)
crinavar commented 3 years ago

I have added the ConstrainDevices=yes in cgroup.conf and it works!. Already tested for several combinations of GPUs requested and multiple jobs at the same time. Seems to have been fixed. Many thanks for the help.

flx42 commented 3 years ago

That's great to hear! Closing this bug for now.

However, there seems to be a weird interaction between CUDA_VISIBLE_DEVICES and enroot (or libnvidia-container). Given that Slurm was using the CUDA_VISIBLE_DEVICES approach, I was not expecting to see a subset of the /dev/nvidia{0..7} files inside the container. @3XX0 any idea what happened here?