Open raldone01 opened 1 year ago
As a maintainer of the NVIDIA Container Toolkit which provides the functionality that Docker leverages to support the --gpus
flag in their CLI, I would prefer that the existing CDI device support in the --device
flag be used instead of --gpus
.
I would therefore recommend that one of the following options:
--gpus
flag in Podman issues a clear error or warning.--gpus
flag is mapped to an equivalent --device
flag: For example: --gpus all
is mapped to --device=nvidia.com/gpu=all
.Interested in opening a PR?
I am not familiar with the code base and currently have no time available. 😐
I am not sure what Podman is supposed to do with this information.
@elezar would love to meet with you and discuss how we could better integrate nvidia into Podman. We have lots of HPC customers and partners who are using nvidia devices with Podman, (I believe without requiring the hook).
I may have some cycles to look into this starting next week.
@rhatdan we can try to set something up if you like. As a summary, we're pushing CDI as the mechanism for interacting with NVIDIA GPUs going forward. This allows us to focus on generating CDI specifications for supported platforms with the generated specs consumable by all CDI-enabled clients.
We have been working with the HPC community on some of the features that they would like to see to make running containers with a GPU easier. Have a look at https://github.com/containers/podman/pull/19309
Does this help your situation out.
A friendly reminder that this issue had no activity for 30 days.
I see this is still a bug today but since this uses NVidia's container toolkit package and applies to NV GPUs anyway I managed to get what I need with the CUDA_VISIBLE_DEVICES
env var. If anybody needs a workaround for NVidia:
podman run -e CUDA_VISIBLE_DEVICES=1 ghcr.io/ggerganov/llama.cpp:server-cuda etc...
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
@rhatdan apologies if this conflicts with another fix but it seems to work for me and I'm not sure if it's an acceptable workaround.
@jboero your workaround already assumes that the NVIDIA devices are made available in the container since setting CUDA_VISIBLE_DEVICES
will only affect the selection of devices that are already present. This is most likely because the nvidia-container-runtime-hook
was installed and configured at some point.
Note that as mentioned above, we currently recommend using CDI in Podman since this is supported natively.
Please see https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html and https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-podman for links to Podman-specific instructions.
@elezar Thanks for the tip! You're right, this box has been fed-upped all the way back to F36. I never set up the hooks manually myself but it was added by NVidia's older official nvidia-container-toolkit
package (still recommended and supported by NVidia in Fedora). It looks like they've updated their empty F39 repo with a few more packages, but still no nvidia-container-toolkit
which is unfortunate. The only way I can get any of this close to working in F39 is to hardcode the nvidia repo for F37. I would love to fix NVidia's repos but I think they're still catching up to GCCv13. Is there an official guide for this on Fedora 39? For all practical purposes the CUDA_VISIBLE_DEVICES
worked for me fine because the old hook automatically included all GPUs. Personally I would love to package (or see packaged) a Fedora RPM including a standard post-script for /etc/cdi/nvidia.yaml
. I've been knocking on NVidia's door every few years trying to fix packaging from the inside. They're missing Fedora 38 entirely and partially complete F39 with F40 just around the corner in April.
https://developer.download.nvidia.com/compute/cuda/repos/
@jboero for the NVIDIA Container Toolkit, it is not required to use the CUDA Download repositories. We've recently revamped our packaging to produce a set of deb and rpm pacakges that are compatible with any platforms where the driver can be installed (or should be). This includes all modern Fedora distributions.
You can follow the updated instructions here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-yum-or-dnf to install the latest version of the NVIDIA Container Toolkit (v1.14.5). Note that this does not install the OCI runtime hook.
If you do come across any inconsistencies, please feel free to open an issue against https://github.com/NVIDIA/nvidia-container-toolkit or https://github.com/NVIDIA/cloud-native-docs.
Oh thanks when did that repo emerge? Is that the favoured repo in the future or will the standard CUDA repos be updated also? In my case I also need the cuBLAS packages of the main CUDA repos. Is there any conflict to enabling both at the same time?
The switch to this repo coincided with the v1.14.0 release of the NVIDIA Container Toolkit. We will continue to publish packages there as well as to the CUDA Download repos ... Although as you point out there may be some delay in getting repos for specific distributions. There should be no problems with having both repos enabled, although the priority of the CUDA repos may mean that the latest versions of the packages are only available if explicitly requested.
Issue Description
Note:
docker
connects to thepodman-docker-emulation-daemon
. See also: https://github.com/NVIDIA/nvidia-container-toolkit/issues/126Steps to reproduce the issue
Steps to reproduce the issue Works:
sudo docker run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi -L
sudo podman run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi -L
Does not work:
sudo docker run --rm --gpus all ubuntu nvidia-smi -L
sudo podman run --rm --gpus all ubuntu nvidia-smi -L
Describe the results you received
The
--gpus
option is silently ignored.Describe the results you expected
The
--gpus
option should work or issue a warning that it has been ignored.podman info output
Podman in a container
No
Privileged Or Rootless
Privileged
Upstream Latest Release
Yes
Additional environment details
No response
Additional information
No response