rootless podman sees all GPUs despite cgroups setup

arogozhnikov commented 4 months ago

There is a discussion related to docker #211 But for docker it is expected that root-daemon has access to all gpus.

In my case, I run podman within SLURM, which uses cgroups to control acccess to devices. CPU virtualization works correcctly, but not GPU virtualization, e.g. compare:

srun --label --nodes=2 --ntasks-per-node=1 --gpus-per-task=2 bash -c \
  'nvidia-smi --query-gpu=name,utilization.gpu,memory.used --format=csv'
0: name, utilization.gpu [%], memory.used [MiB]
0: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB
0: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB
1: name, utilization.gpu [%], memory.used [MiB]
1: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB
1: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB

vs

srun --label --nodes=2 --ntasks-per-node=1 --gpus-per-task=2 podman run --rm --device nvidia.com/gpu=all docker.io/nvidia/cuda:12.2.2-base-ubuntu22.04 bash -c \
  'nvidia-smi --query-gpu=name,utilization.gpu,memory.used --format=csv'
1: name, utilization.gpu [%], memory.used [MiB]
1: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB
1: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB
1: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB
1: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB
1: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB
1: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB
1: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB
1: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB
0: name, utilization.gpu [%], memory.used [MiB]
0: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB
0: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB
0: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB
0: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB
0: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB
0: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB
0: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB
0: NVIDIA A100-SXM4-80GB, 0 %, 0 MiB

elezar commented 4 months ago

@arogozhnikov it seems as if specifying --device nvidia.com/gpu=all is overriding the devices that are available to the container.

Here it would be interesting to see whether running podman in general allows you to modify the cgroup permissions?

Does running:

podman run --rm --device-cgroup-rule=”c 195:* rwm” docker.io/nvidia/cuda:12.2.2-base-ubuntu22.04 bash

Allow you to access unexpected devices in the container?

arogozhnikov commented 4 months ago

srun --label --nodes=2 --ntasks-per-node=1 --gpus-per-task=2 podman run --rm --device-cgroup-rule="c 195:* rwm" --device nvidia.com/gpu=all docker.io/nvidia/cuda:12.2.2-base-ubuntu22.04 bash -c \
  'nvidia-smi --query-gpu=name,utilization.gpu,memory.used --format=csv'

output:

1: Error: device cgroup rules are not supported in rootless mode or in a user namespace
srun: error: compute-permanent-node-652: task 1: Exited with exit code 125
0: Error: device cgroup rules are not supported in rootless mode or in a user namespace
srun: error: compute-permanent-node-345: task 0: Exited with exit code 125

arogozhnikov commented 4 months ago

(I am not sure what conclusion should I draw, so let me know if that's helpful or if you want me to run something else)

Running on worker node verbatim command you posted provides the same result:

podman run --rm --device-cgroup-rule="c 195:* rwm" docker.io/nvidia/cuda:12.2.2-base-ubuntu22.04 bash
Error: device cgroup rules are not supported in rootless mode or in a user namespace

elezar commented 4 months ago

@arogozhnikov it is my understanding that rootless podman (and runc/crun as the low-level runtime) use bind mounts for device nodes in the rootless case.

What does something like:

srun --label --nodes=2 --ntasks-per-node=1 --gpus-per-task=2 podman run --rm --device nvidia.com/gpu=0 docker.io/nvidia/cuda:12.2.2-base-ubuntu22.04 bash -c \
  'nvidia-smi -L'

yield?

Then what about:

srun --label --nodes=2 --ntasks-per-node=1 --gpus-per-task=2 podman run --rm --device nvidia.com/gpu=0 -v /dev/nvidia3:/dev/nvidia3 docker.io/nvidia/cuda:12.2.2-base-ubuntu22.04 bash -c \
  'nvidia-smi -L'

NVIDIA / nvidia-container-toolkit

rootless podman sees all GPUs despite cgroups setup #585