NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.18k stars 239 forks source link

NVIDIA NVS 5400M GPU support #63

Open nandlab opened 1 year ago

nandlab commented 1 year ago

Is it possible to use nvidia-container-toolkit with the notebook NVIDIA NVS 5400M GPU on Linux? The latest compatible driver for it is 390.157. It supports up to CUDA 9.1.

If not, is there an older version of nvidia-container-toolkit that will work with this driver?

P.S.: I would like to use a docker container with a gazebo installation with hardware accelerated graphics.

elezar commented 1 year ago

@nandlab I don't recall which version of the toolkit supports the 390.157 driver. With that said, you may be able to generate a CDI specification on your system using nvidia-ctk cdi generate and then use the generated spec.

Podman (>4.1.0) natively supports CDI and it is possible to configure the nvidia-container-runtime to perform the injection of the devices when using the Docker CLI.

Does the nvidia-ckt cdi generate command generate a spec on your system?

nandlab commented 1 year ago

@elezar Thank you for the fast reply!

sudo nvidia-ctk cdi generate outputs:

INFO[0000] Auto-detected mode as "nvml"                 
INFO[0000] Selecting /dev/nvidia0 as /dev/nvidia0       
INFO[0000] Selecting /dev/dri/card0 as /dev/dri/card0   
INFO[0000] Selecting /dev/dri/renderD128 as /dev/dri/renderD128 
nvidia-ctk: symbol lookup error: nvidia-ctk: undefined symbol: nvmlDeviceGetMaxMigDeviceCount

It exits with an error code of 127.

Btw, here is the output of nvidia-smi:

Mon May 22 12:57:20 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.157                Driver Version: 390.157                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVS 5400M           Off  | 00000000:01:00.0 N/A |                  N/A |
| N/A   42C    P8    N/A /  N/A |     52MiB /   959MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0                    Not Supported                                       |
+-----------------------------------------------------------------------------+

The output of nvidia-smi is the same in the container, but I get no hardware accelerated graphics anyway. Here is how I start an Ubuntu container for testing: sudo docker run -it -h "$HOSTNAME" -e "DISPLAY=$DISPLAY" -v '/tmp/.X11-unix:/tmp/.X11-unix' -v "$HOME/.Xauthority:/root/.Xauthority" --runtime nvidia --gpus 'all,capabilities=utility' --rm ubuntu

Is there anything else I can try?

elezar commented 1 year ago

OK, that should not fail in this mode since we don't expect to generate specs for MIG devices in any case. I will create a ticket to track.

For now, you could use:

nvidia-ctk cdi generate --mode=management

To generate a basic spec with a single devices (nvidia.com/gpu=all). Does that produce output?

nandlab commented 1 year ago

nvidia-ctk cdi generate --mode=management also fails:

INFO[0000] Selecting /dev/nvidia-modeset as /dev/nvidia-modeset 
INFO[0000] Selecting /dev/nvidia-uvm as /dev/nvidia-uvm 
INFO[0000] Selecting /dev/nvidia-uvm-tools as /dev/nvidia-uvm-tools 
INFO[0000] Selecting /dev/nvidia0 as /dev/nvidia0       
INFO[0000] Selecting /dev/nvidiactl as /dev/nvidiactl   
WARN[0000] Could not locate /dev/nvidia-caps/nvidia-cap*: pattern /dev/nvidia-caps/nvidia-cap* not found 
ERRO[0000] failed to generate CDI spec: failed to create edits common for entities: failed to get CUDA version: failed to locate libcuda.so: pattern libcuda.so.*.*.* not found
elezar commented 1 year ago

@nandlab which version of the toolkit is this? The final error you're seeing should be addressed in the latest version (v1.13.1), but maybe something was missed in the fix for that.

Note I have created https://gitlab.com/nvidia/cloud-native/go-nvlib/-/merge_requests/40 to start working on the initial error you're seeing and will update the NVIDIA Container Toolkit once that is merged.

nandlab commented 1 year ago

My installed version of nvidia-container-toolkit-base is 1.13.1-1 (buster).

elezar commented 1 year ago

Actually, looking at your NVIDIA SMI output, I would assume that your libcuda library is libcuda.so.390.157 and not libcuda.so.390.157.x which is the pattern that we're trying to match. I have created https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/397 which should allow this to proceed. Would you be able to test with a build of this executable?

You should be able to run make docker-cmd-nvidia-ctk to generate a local nvidia-ctk binary with the changes for testing purposes.

nandlab commented 1 year ago

I tried to make docker-cmd-nvidia-ctk in the use-major-minor-for-cuda-version branch but it aborts with:

if [ x"" = x"" ]; then \
    docker build \
        --progress=plain \
        --build-arg GOLANG_VERSION="1.20.3" \
        --tag nvidia/container-toolkit-build:golang1.20.3 \
        -f docker/Dockerfile.devel \
        docker; \
fi
Sending build context to Docker daemon  20.48kB
Step 1/7 : ARG GOLANG_VERSION=x.x.x
Step 2/7 : FROM golang:${GOLANG_VERSION}
 ---> 4237fa9a9df4
Step 3/7 : RUN go install golang.org/x/lint/golint@6edffad5e6160f5949cdefc81710b2706fbcd4f6
 ---> Using cache
 ---> ac387ef1abdf
Step 4/7 : RUN go install github.com/matryer/moq@latest
 ---> Using cache
 ---> 1b8cb9c74df0
Step 5/7 : RUN go install github.com/gordonklaus/ineffassign@d2c82e48359b033cde9cf1307f6d5550b8d61321
 ---> Using cache
 ---> 60ba1079891b
Step 6/7 : RUN go install github.com/client9/misspell/cmd/misspell@latest
 ---> Using cache
 ---> 3c825ab8aa3d
Step 7/7 : RUN go install github.com/google/go-licenses@latest
 ---> Using cache
 ---> 96a57dc20a94
Successfully built 96a57dc20a94
Successfully tagged nvidia/container-toolkit-build:golang1.20.3
Running 'make cmd-nvidia-ctk' in docker container nvidia/container-toolkit-build:golang1.20.3
docker run \
    --rm \
    -e GOCACHE=/tmp/.cache \
    -v : \
    -w  \
    --user $(id -u):$(id -g) \
    nvidia/container-toolkit-build:golang1.20.3 \
        make cmd-nvidia-ctk
docker: Error response from daemon: the working directory '--user' is invalid, it needs to be an absolute path.
See 'docker run --help'.
make: *** [Makefile:141: docker-cmd-nvidia-ctk] Error 125

It looks like the -w argument to Docker expects a working directory string as argument.

elezar commented 1 year ago

The make target is:

$(DOCKER_TARGETS): docker-%: .build-image
    @echo "Running 'make $(*)' in docker container $(BUILDIMAGE)"
    $(DOCKER) run \
        --rm \
        -e GOCACHE=/tmp/.cache \
        -v $(PWD):$(PWD) \
        -w $(PWD) \
        --user $$(id -u):$$(id -g) \
        $(BUILDIMAGE) \
            make $(*)

meaning that in your case the PWD envvar / make variable is not set. Could you repeat with:

PWD=$(pwd) make docker-cmd-nvidia-ctk
nandlab commented 1 year ago

PWD=$(pwd) make docker-cmd-nvidia-ctk

This way the compilation worked fine.

Output of ./nvidia-ctk cdi generate (did not change):

INFO[0000] Auto-detected mode as "nvml"                 
INFO[0000] Selecting /dev/nvidia0 as /dev/nvidia0       
INFO[0000] Selecting /dev/dri/card0 as /dev/dri/card0   
INFO[0000] Selecting /dev/dri/renderD128 as /dev/dri/renderD128 
./nvidia-ctk: symbol lookup error: ./nvidia-ctk: undefined symbol: nvmlDeviceGetMaxMigDeviceCount

Output of ./nvidia-ctk cdi generate --mode=management (looks good but there are a few warnings):

INFO[0000] Selecting /dev/nvidia-modeset as /dev/nvidia-modeset 
INFO[0000] Selecting /dev/nvidia-uvm as /dev/nvidia-uvm 
INFO[0000] Selecting /dev/nvidia-uvm-tools as /dev/nvidia-uvm-tools 
INFO[0000] Selecting /dev/nvidia0 as /dev/nvidia0       
INFO[0000] Selecting /dev/nvidiactl as /dev/nvidiactl   
WARN[0000] Could not locate /dev/nvidia-caps/nvidia-cap*: pattern /dev/nvidia-caps/nvidia-cap* not found 
INFO[0000] Using driver version 390.157                 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libEGL_nvidia.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libEGL_nvidia.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLESv1_CM_nvidia.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLESv1_CM_nvidia.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLESv2_nvidia.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLESv2_nvidia.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLX_nvidia.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLX_nvidia.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libcuda.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libcuda.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvcuvid.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvcuvid.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-cfg.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-cfg.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-encode.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-encode.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-ml.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-ml.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-ptxjitcompiler.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-ptxjitcompiler.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libvdpau_nvidia.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libvdpau_nvidia.so.390.157 
WARN[0000] Could not locate /nvidia-persistenced/socket: pattern /nvidia-persistenced/socket not found 
WARN[0000] Could not locate /nvidia-fabricmanager/socket: pattern /nvidia-fabricmanager/socket not found 
WARN[0000] Could not locate /tmp/nvidia-mps: pattern /tmp/nvidia-mps not found 
WARN[0000] Could not locate /lib/firmware/nvidia/390.157/gsp*.bin: pattern /lib/firmware/nvidia/390.157/gsp*.bin not found 
INFO[0000] Selecting /usr/bin/nvidia-smi as /usr/bin/nvidia-smi 
INFO[0000] Selecting /usr/bin/nvidia-debugdump as /usr/bin/nvidia-debugdump 
INFO[0000] Selecting /usr/bin/nvidia-persistenced as /usr/bin/nvidia-persistenced 
WARN[0000] Could not locate nvidia-cuda-mps-control: pattern nvidia-cuda-mps-control not found 
WARN[0000] Could not locate nvidia-cuda-mps-server: pattern nvidia-cuda-mps-server not found 
INFO[0000] Generated CDI spec with version 0.3.0        
cdiVersion: 0.3.0
containerEdits:
  hooks:
  - args:
    - nvidia-ctk
    - hook
    - update-ldcache
    - --folder
    - /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx
    hookName: createContainer
    path: /usr/bin/nvidia-ctk
  mounts:
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLESv2_nvidia.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLESv2_nvidia.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLX_nvidia.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLX_nvidia.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvcuvid.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvcuvid.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-cfg.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-cfg.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-ptxjitcompiler.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-ptxjitcompiler.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libvdpau_nvidia.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libvdpau_nvidia.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libEGL_nvidia.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libEGL_nvidia.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLESv1_CM_nvidia.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLESv1_CM_nvidia.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libcuda.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libcuda.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-encode.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-encode.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-ml.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-ml.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/bin/nvidia-smi
    hostPath: /usr/bin/nvidia-smi
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/bin/nvidia-debugdump
    hostPath: /usr/bin/nvidia-debugdump
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/bin/nvidia-persistenced
    hostPath: /usr/bin/nvidia-persistenced
    options:
    - ro
    - nosuid
    - nodev
    - bind
devices:
- containerEdits:
    deviceNodes:
    - path: /dev/nvidia0
    - path: /dev/nvidiactl
    - path: /dev/nvidia-modeset
    - path: /dev/nvidia-uvm
    - path: /dev/nvidia-uvm-tools
  name: all
kind: nvidia.com/gpu

Can the warnings be ignored?

elezar commented 1 year ago

Thanks. Those warnings are expected in this case. Thanks for the update.

Note that I have created https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/398 which should fix the nvidia-ctk cdi generate (with default mode) command. Would you also be able to test that build?

nandlab commented 1 year ago

Thank you for the support! sudo podman run -ti --rm --device=nvidia.com/gpu=0 ubuntu:18.04 nvidia-smi -L says: Error: stat nvidia.com/gpu=0: no such file or directory My Podman version is 3.0.1. Is my Podman too old for CDI? Can you pass the CDI yaml with a different option?

elezar commented 1 year ago

A podman version of at least 4.1.0 would be required for native CDI support. If this cannot be installed / built from source, is an alternative:

  1. Generate the CDI specification at /etc/cdi/nvidia.yaml by running sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
  2. Update the permissions of the /etc/cdi/nvidia.yaml file to be world-readable (this will be addressed in the next release): sudo chmod 644 /etc/cdi/nvidia.yaml
  3. Configure the nvidia-container-runtime to use CDI: change the mode = "auto" setting in /etc/nvidia-container-runtime/config.toml to mode = "cdi"
  4. If you're using docker:
    1. Ensure that docker is configured to use the NVIDIA container runtime: sudo nvidia-ctk runtime configure and restart the docker daemon: sudo systemctl restart docker
    2. Run your container using the nvidia runtime. For Docker this would be: docker run --rm -ti --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 {{IMAGE}}
  5. If you're using podman, specify the full path to the NVIDIA Container Runtime as the --runtime: podman run --rm -ti --runtime=/usr/bin/nvidia-container-runtime NVIDIA_VISIBLE_DEVICES=0 {{IMAGE}}

Note that NVIDIA_VISIBLE_DEVICES=0 can also be replaced with NVIDIA_VISIBLE_DEVICES=nvidia.com/gpu=0 as the NVIDIA Container Runtime in CDI mode will assume the nvidia.com/gpu CDI device class by default.

We do need to update our documentation to better describe this process, so please let us know if this is unclear.

nandlab commented 1 year ago

Hi, sorry for the late response.

I followed your steps but it still does not work. sudo podman run --rm -ti --runtime=/usr/bin/nvidia-container-runtime NVIDIA_VISIBLE_DEVICES=0 prints Error: invalid reference format and exits with code 125.

sudo docker run --rm -ti --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 ubuntu prints

docker: Error response from daemon: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices nvidia.com/gpu=0: unknown.

and exits with code 125.

Instead, sudo docker run --rm -ti --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all ubuntu prints

docker: Error response from daemon: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: failed to inject devices: failed to stat CDI host device "/dev/nvidia-uvm": no such file or directory: unknown.

and exits with code 127.

The device /dev/nvidia-uvm does indeed not exist on my machine.

nandlab commented 1 year ago

I recreated /etc/cdi/nvidia.yaml with the newest nvidia-ctk from the container-tools main branch:

INFO[0000] Auto-detected mode as "nvml"                 
INFO[0000] Selecting /dev/nvidia0 as /dev/nvidia0       
INFO[0000] Selecting /dev/dri/card0 as /dev/dri/card0   
INFO[0000] Selecting /dev/dri/renderD128 as /dev/dri/renderD128 
INFO[0000] Using driver version 390.157                 
INFO[0000] Selecting /dev/nvidia-modeset as /dev/nvidia-modeset 
WARN[0000] Could not locate /dev/nvidia-uvm-tools: pattern /dev/nvidia-uvm-tools not found 
WARN[0000] Could not locate /dev/nvidia-uvm: pattern /dev/nvidia-uvm not found 
INFO[0000] Selecting /dev/nvidiactl as /dev/nvidiactl   
WARN[0000] Could not locate libnvidia-egl-gbm.so: 64-bit library libnvidia-egl-gbm.so not found 
INFO[0000] Selecting /usr/share/glvnd/egl_vendor.d/10_nvidia.json as /usr/share/glvnd/egl_vendor.d/10_nvidia.json 
INFO[0000] Selecting /usr/share/vulkan/icd.d/nvidia_icd.json as /usr/share/vulkan/icd.d/nvidia_icd.json 
INFO[0000] Selecting /usr/share/vulkan/implicit_layer.d/nvidia_layers.json as /usr/share/vulkan/implicit_layer.d/nvidia_layers.json 
WARN[0000] Could not locate egl/egl_external_platform.d/15_nvidia_gbm.json: pattern egl/egl_external_platform.d/15_nvidia_gbm.json not found 
WARN[0000] Could not locate egl/egl_external_platform.d/10_nvidia_wayland.json: pattern egl/egl_external_platform.d/10_nvidia_wayland.json not found 
WARN[0000] Could not locate nvidia/xorg/nvidia_drv.so: pattern nvidia/xorg/nvidia_drv.so not found 
WARN[0000] Could not locate nvidia/xorg/libglxserver_nvidia.so.390.157: pattern nvidia/xorg/libglxserver_nvidia.so.390.157 not found 
WARN[0000] Could not locate X11/xorg.conf.d/10-nvidia.conf: pattern X11/xorg.conf.d/10-nvidia.conf not found 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libEGL_nvidia.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libEGL_nvidia.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLESv1_CM_nvidia.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLESv1_CM_nvidia.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLESv2_nvidia.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLESv2_nvidia.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLX_nvidia.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLX_nvidia.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libcuda.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libcuda.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvcuvid.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvcuvid.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-cfg.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-cfg.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-encode.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-encode.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-ml.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-ml.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-ptxjitcompiler.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-ptxjitcompiler.so.390.157 
INFO[0000] Selecting /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libvdpau_nvidia.so.390.157 as /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libvdpau_nvidia.so.390.157 
INFO[0000] Selecting /run/nvidia-persistenced/socket as /run/nvidia-persistenced/socket 
WARN[0000] Could not locate /nvidia-fabricmanager/socket: pattern /nvidia-fabricmanager/socket not found 
WARN[0000] Could not locate /tmp/nvidia-mps: pattern /tmp/nvidia-mps not found 
WARN[0000] Could not locate /lib/firmware/nvidia/390.157/gsp*.bin: pattern /lib/firmware/nvidia/390.157/gsp*.bin not found 
INFO[0000] Selecting /usr/bin/nvidia-smi as /usr/bin/nvidia-smi 
INFO[0000] Selecting /usr/bin/nvidia-debugdump as /usr/bin/nvidia-debugdump 
INFO[0000] Selecting /usr/bin/nvidia-persistenced as /usr/bin/nvidia-persistenced 
WARN[0000] Could not locate nvidia-cuda-mps-control: pattern nvidia-cuda-mps-control not found 
WARN[0000] Could not locate nvidia-cuda-mps-server: pattern nvidia-cuda-mps-server not found 
WARN[0000] Could not locate nvidia/xorg/nvidia_drv.so: pattern nvidia/xorg/nvidia_drv.so not found 
WARN[0000] Could not locate nvidia/xorg/libglxserver_nvidia.so.390.157: pattern nvidia/xorg/libglxserver_nvidia.so.390.157 not found 
INFO[0000] Generated CDI spec with version 0.5.0        
cdiVersion: 0.5.0
containerEdits:
  deviceNodes:
  - path: /dev/nvidia-modeset
  - path: /dev/nvidiactl
  hooks:
  - args:
    - nvidia-ctk
    - hook
    - update-ldcache
    - --folder
    - /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx
    hookName: createContainer
    path: /usr/bin/nvidia-ctk
  mounts:
  - containerPath: /usr/share/glvnd/egl_vendor.d/10_nvidia.json
    hostPath: /usr/share/glvnd/egl_vendor.d/10_nvidia.json
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/share/vulkan/icd.d/nvidia_icd.json
    hostPath: /usr/share/vulkan/icd.d/nvidia_icd.json
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/share/vulkan/implicit_layer.d/nvidia_layers.json
    hostPath: /usr/share/vulkan/implicit_layer.d/nvidia_layers.json
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLESv1_CM_nvidia.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLESv1_CM_nvidia.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLESv2_nvidia.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLESv2_nvidia.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libcuda.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libcuda.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-cfg.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-cfg.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-ml.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-ml.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libvdpau_nvidia.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libvdpau_nvidia.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libEGL_nvidia.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libEGL_nvidia.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLX_nvidia.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libGLX_nvidia.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvcuvid.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvcuvid.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-encode.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-encode.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-ptxjitcompiler.so.390.157
    hostPath: /usr/lib/x86_64-linux-gnu/nvidia/legacy-390xx/libnvidia-ptxjitcompiler.so.390.157
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /run/nvidia-persistenced/socket
    hostPath: /run/nvidia-persistenced/socket
    options:
    - ro
    - nosuid
    - nodev
    - bind
    - noexec
  - containerPath: /usr/bin/nvidia-smi
    hostPath: /usr/bin/nvidia-smi
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/bin/nvidia-debugdump
    hostPath: /usr/bin/nvidia-debugdump
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/bin/nvidia-persistenced
    hostPath: /usr/bin/nvidia-persistenced
    options:
    - ro
    - nosuid
    - nodev
    - bind
devices:
- containerEdits:
    deviceNodes:
    - path: /dev/nvidia0
    - path: /dev/dri/card0
    - path: /dev/dri/renderD128
    hooks:
    - args:
      - nvidia-ctk
      - hook
      - create-symlinks
      - --link
      - ../card0::/dev/dri/by-path/pci-0000:01:00.0-card
      - --link
      - ../renderD128::/dev/dri/by-path/pci-0000:01:00.0-render
      hookName: createContainer
      path: /usr/bin/nvidia-ctk
    - args:
      - nvidia-ctk
      - hook
      - chmod
      - --mode
      - "755"
      - --path
      - /dev/dri
      hookName: createContainer
      path: /usr/bin/nvidia-ctk
  name: "0"
- containerEdits:
    deviceNodes:
    - path: /dev/nvidia0
    - path: /dev/dri/card0
    - path: /dev/dri/renderD128
    hooks:
    - args:
      - nvidia-ctk
      - hook
      - create-symlinks
      - --link
      - ../card0::/dev/dri/by-path/pci-0000:01:00.0-card
      - --link
      - ../renderD128::/dev/dri/by-path/pci-0000:01:00.0-render
      hookName: createContainer
      path: /usr/bin/nvidia-ctk
    - args:
      - nvidia-ctk
      - hook
      - chmod
      - --mode
      - "755"
      - --path
      - /dev/dri
      hookName: createContainer
      path: /usr/bin/nvidia-ctk
  name: all
kind: nvidia.com/gpu

Now I can start a docker container with the nvidia runtime without problems, for example: sudo docker run -it -h "$HOSTNAME" --ipc=host -e "DISPLAY=$DISPLAY" -v '/tmp/.X11-unix:/tmp/.X11-unix' -v "$HOME/.Xauthority:/root/.Xauthority" --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all -e NVIDIA_DRIVER_CAPABILITIES=all ubuntu

But GUI programs are apparently still software rendered, e.g. glxgears from mesa-utils uses 50% of the CPU.

elezar commented 1 year ago

@nandlab great news that you were able to get CDI injection workging. The reason for not using the hardware renderer is most likely due to the missing X libraries as listed in the log:

WARN[0000] Could not locate nvidia/xorg/nvidia_drv.so: pattern nvidia/xorg/nvidia_drv.so not found 
WARN[0000] Could not locate nvidia/xorg/libglxserver_nvidia.so.390.157: pattern nvidia/xorg/libglxserver_nvidia.so.390.157 not found 
WARN[0000] Could not locate X11/xorg.conf.d/10-nvidia.conf: pattern X11/xorg.conf.d/10-nvidia.conf not found 

Where are these located on your system?

nandlab commented 1 year ago

Where should I look for these patterns? There are symlinks in many places. In /usr/lib/nvidia there are the symlinks nvidia_drv.so and libglx.so.

I could not find X11/xorg.conf.d/10-nvidia.conf