Open raldone01 opened 1 year ago
One note: sudo podman run --rm --gpus all ubuntu nvidia-smi -L
would not be expected to work. The --gpus
flag in podman is a no-op argument that is only provided for compatibility with the Docker CLI.
Which version(s) of the NVIDIA Container Toolkit packages are installed on the system? The --gpus
flag in Docker requires that the nvidia-container-runtime-hook
be installed when the docker daemon is started. Does restarting the daemon help in this case?
Thanks for your reply.
❯ nvidia-container-runtime-hook --version
NVIDIA Container Runtime Hook version 1.13.5
❯ nvidia-container-toolkit --version
NVIDIA Container Runtime Hook version 1.13.5
❯ ls /usr/share/containers/oci/hooks.d/
00-oci-nvidia-hook.json
If --gpus
is provided for compatibility and is not properly implemented it should at least issue a warning that it has been ignored.
I could not figure out why it did nothing.
I do not have a docker daemon running as I use podman. (daemon.json does not exist) I have a podman daemon running that emulates the docker api.
Does sudo podman run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi -L
load the hooks?
How does the container runtime know when to load which hook?
I do not know how docker-compose works internally.
If it calls the docker cli with the --gpus
option it is clear why it doesn't work as podman just ignores it.
So far I have not had compatibility issues with docker-compose
.
Should I also take this issue to podman, docker-compose or somewhere else?
Do I have other options than switching back to docker again?
If --gpus is provided for compatibility and is not properly implemented it should at least issue a warning that it has been ignored. I could not figure out why it did nothing.
We do not maintain the podman implementation of --gpus
, you will need to open an issue with podman
directly.
As a side note, the addition of --gpus
(even in docker) should never have happened. It was not a very well thought out API and has caused us (NVIDIA) many problems since being introduced.
The future for GPU support in containerized environments is CDI, which is the API provided by podman with e.g. --device nvidia.com/gpu=all
. Similar support will be available in docker (as an experimental feature) staring with Docker 25.
We do not maintain the podman implementation of
--gpus
, you will need to open an issue withpodman
directly.
Done (https://github.com/containers/podman/issues/19330)
Ok so podman
is actually front running? Interesting.
The following syntax does not work with my current setup.
deploy:
resources:
reservations:
devices:
- driver: nvidia
# device_ids: ["GPU-XXX"]
count: 1
capabilities: [gpu, utility, compute]
I was under the impression that this is the new and recommended approach even supporting device ids.
I am pretty sure that this is an issue with the docker->podman
emulation.
Can you link me to the proper syntax of the --device
option?
I only found information about the mapping syntax: --device=/dev/sdc:/dev/xvdc:rwm
If I want to pass through two specific gpus can I just write --device nvidia.com/gpu=GPU-XXX-XXX-1 --device nvidia.com/gpu=GPU-XXX-XXX-2
?
Do you know how/if cdi
support is planned in docker-compose
?
Can you link me to resources about cdi
?
Can you link me to the documentation listing all valid capability strings?
I found conflicting information.
Is there a way to get docker-compose
to work?
CDI Support in Docker will be an experimental feature in the Docker 25 release that has not yet gone out. With this in place, CDI should already be supported in Docker Compose with following services
section:
services:
test:
image: application-image
command: do-the-thing
deploy:
resources:
reservations:
devices:
# NVIDIA Devices
- driver: cdi
device_ids:
- nvidia.com/gpu=0
- nvidia.com/gds=all
- nvidia.com/mofed=all
# Intel Devices
- driver: cdi
device_ids:
- intel.com/vf=0
- intel.com/fpga=fist
Note the cdi
driver and the device IDs specifying fully-qualified CDI device names.
This means that if you:
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml --device-name-strategy=uuid
Then specifying the devices through --device=nvidia.com/gpu={{GPU_UUID}}
or in the Docker compose specification.
We are still in the process of improving our documentation around CDI and its use in the various runtimes that support it.
(updated deviceIDs
to device_ids
)
Note it may be that Podman does not yet support CDI devices in Docker Compose. If you see that this is not the case, please create an issue in Podman and CC me.
Thanks @elezar, I was looking for a way to do this and your comment helped - but I had to change deviceIDs
to device_ids
. This is with Docker Compose version v2.24.7
.
Thanks @elezar, I was looking for a way to do this and your comment helped - but I had to change
deviceIDs
todevice_ids
. This is withDocker Compose version v2.24.7
.
Yes, there was a typo in my comment. I have updated it.
Running
docker
andpodman
directlyWorks:
sudo docker run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi -L
sudo podman run --rm --device nvidia.com/gpu=all ubuntu nvidia-smi -L
Does not work:
sudo docker run --rm --gpus all ubuntu nvidia-smi -L
sudo podman run --rm --gpus all ubuntu nvidia-smi -L
The
--gpus all
commands fail with the following output:The nvidia device files are also not present.
I have installed
nvidia-container-toolkit
and ransudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
.Running emulated
docker-compose
sudo docker-compose up
I have no idea what is wrong and appreciate any advice.