Open arogozhnikov opened 4 months ago
@arogozhnikov it seems as if specifying --device nvidia.com/gpu=all
is overriding the devices that are available to the container.
Here it would be interesting to see whether running podman
in general allows you to modify the cgroup permissions?
Does running:
podman run --rm --device-cgroup-rule=”c 195:* rwm” docker.io/nvidia/cuda:12.2.2-base-ubuntu22.04 bash
Allow you to access unexpected devices in the container?
srun --label --nodes=2 --ntasks-per-node=1 --gpus-per-task=2 podman run --rm --device-cgroup-rule="c 195:* rwm" --device nvidia.com/gpu=all docker.io/nvidia/cuda:12.2.2-base-ubuntu22.04 bash -c \
'nvidia-smi --query-gpu=name,utilization.gpu,memory.used --format=csv'
output:
1: Error: device cgroup rules are not supported in rootless mode or in a user namespace
srun: error: compute-permanent-node-652: task 1: Exited with exit code 125
0: Error: device cgroup rules are not supported in rootless mode or in a user namespace
srun: error: compute-permanent-node-345: task 0: Exited with exit code 125
(I am not sure what conclusion should I draw, so let me know if that's helpful or if you want me to run something else)
Running on worker node verbatim command you posted provides the same result:
podman run --rm --device-cgroup-rule="c 195:* rwm" docker.io/nvidia/cuda:12.2.2-base-ubuntu22.04 bash
Error: device cgroup rules are not supported in rootless mode or in a user namespace
@arogozhnikov it is my understanding that rootless podman (and runc
/crun
as the low-level runtime) use bind mounts for device nodes in the rootless case.
What does something like:
srun --label --nodes=2 --ntasks-per-node=1 --gpus-per-task=2 podman run --rm --device nvidia.com/gpu=0 docker.io/nvidia/cuda:12.2.2-base-ubuntu22.04 bash -c \
'nvidia-smi -L'
yield?
Then what about:
srun --label --nodes=2 --ntasks-per-node=1 --gpus-per-task=2 podman run --rm --device nvidia.com/gpu=0 -v /dev/nvidia3:/dev/nvidia3 docker.io/nvidia/cuda:12.2.2-base-ubuntu22.04 bash -c \
'nvidia-smi -L'
There is a discussion related to docker #211 But for docker it is expected that root-daemon has access to all gpus.
In my case, I run podman within SLURM, which uses cgroups to control acccess to devices. CPU virtualization works correcctly, but not GPU virtualization, e.g. compare:
vs