NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.86k stars 634 forks source link

Could not load NVML library: libnvidia-ml.so.1 in K3S cluster #1011

Closed santurini closed 1 week ago

santurini commented 1 month ago

I am following @elezar guide on how to enable MPS in a kubernetes cluster (I am using k3s) and after deploying the gpu-operator, the nvidia-device-plugin-ctr fails to start.

Similar Issues

This is similar to #478 so I would ask also @klueska to take a look and try to help me. I also checked that libnvidia-ml.so.1 was present in the machine and actually it is, located here: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1

Executed commands

helm install \
      -n gpu-operator \
      --generate-name \
      --create-namespace \
      --set devicePlugin.enabled=false \
      --set gfd.enabled=false \
      nvidia/gpu-operator

cat << EOF > /tmp/dp-mps-10.yaml
version: v1
sharing:
  mps:
    resources:
    - name: nvidia.com/gpu
      replicas: 10
EOF

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
    --version=0.15.0 \
    --namespace nvidia-device-plugin \
    --create-namespace \
    --set gfd.enabled=true \
    --set config.default=mps10 \
    --set-file config.map.mps10=/tmp/dp-mps-10.yaml

Failing pod logs

I0117 15:43:15.906553 1 main.go:256] Retrieving plugins.
W0117 15:43:15.907261 1 factory.go:31] No valid resources detected, creating a null CDI handler
I0117 15:43:15.907336 1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0117 15:43:15.907377 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0117 15:43:15.907388 1 factory.go:115] Incompatible platform detected
E0117 15:43:15.907392 1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0117 15:43:15.907396 1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0117 15:43:15.907401 1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0117 15:43:15.907405 1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes

NVIDIA libraries

||/ Name                                  Version                     Architecture Description
+++-=====================================-===========================-============-=====================================================
un  libgldispatch0-nvidia                 <none>                      <none>       (no description available)
ii  libnvidia-cfg1-535-server:amd64       535.183.06-0ubuntu0.20.04.1 amd64        NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any                    <none>                      <none>       (no description available)
un  libnvidia-compute                     <none>                      <none>       (no description available)
ii  libnvidia-compute-535-server:amd64    535.183.06-0ubuntu0.20.04.1 amd64        NVIDIA libcompute package
ii  libnvidia-container-tools             1.16.2-1                    amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64            1.16.2-1                    amd64        NVIDIA container runtime library
un  libnvidia-decode                      <none>                      <none>       (no description available)
ii  libnvidia-decode-535-server:amd64     535.183.06-0ubuntu0.20.04.1 amd64        NVIDIA Video Decoding runtime libraries
un  libnvidia-encode                      <none>                      <none>       (no description available)
ii  libnvidia-encode-535-server:amd64     535.183.06-0ubuntu0.20.04.1 amd64        NVENC Video Encoding runtime library
un  libnvidia-ml1                         <none>                      <none>       (no description available)
un  nvidia-384                            <none>                      <none>       (no description available)
un  nvidia-390                            <none>                      <none>       (no description available)
un  nvidia-compute-utils                  <none>                      <none>       (no description available)
ii  nvidia-compute-utils-535-server       535.183.06-0ubuntu0.20.04.1 amd64        NVIDIA compute utilities
un  nvidia-container-runtime              <none>                      <none>       (no description available)
un  nvidia-container-runtime-hook         <none>                      <none>       (no description available)
ii  nvidia-container-toolkit              1.15.0-1                    amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base         1.15.0-1                    amd64        NVIDIA Container Toolkit Base
ii  nvidia-dkms-535-server                535.183.06-0ubuntu0.20.04.1 amd64        NVIDIA DKMS package
un  nvidia-dkms-kernel                    <none>                      <none>       (no description available)
un  nvidia-driver-535-server              <none>                      <none>       (no description available)
un  nvidia-firmware-535-535.183.06        <none>                      <none>       (no description available)
ii  nvidia-firmware-535-server-535.183.06 535.183.06-0ubuntu0.20.04.1 amd64        Firmware files used by the kernel module
un  nvidia-headless                       <none>                      <none>       (no description available)
ii  nvidia-headless-535-server            535.183.06-0ubuntu0.20.04.1 amd64        NVIDIA headless metapackage
ii  nvidia-headless-no-dkms-535-server    535.183.06-0ubuntu0.20.04.1 amd64        NVIDIA headless metapackage - no DKMS
un  nvidia-kernel-common                  <none>                      <none>       (no description available)
ii  nvidia-kernel-common-535-server       535.183.06-0ubuntu0.20.04.1 amd64        Shared files used with the kernel module
un  nvidia-kernel-source                  <none>                      <none>       (no description available)
ii  nvidia-kernel-source-535-server       535.183.06-0ubuntu0.20.04.1 amd64        NVIDIA kernel source package
un  nvidia-opencl-icd                     <none>                      <none>       (no description available)
un  nvidia-persistenced                   <none>                      <none>       (no description available)
un  nvidia-smi                            <none>                      <none>       (no description available)
un  nvidia-utils                          <none>                      <none>       (no description available)
ii  nvidia-utils-535-server               535.183.06-0ubuntu0.20.04.1 amd64        NVIDIA Server Driver support binaries 

NVIDIA-SMI output

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:07:00.0 Off |                  N/A |
| 30%   36C    P0             114W / 350W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off | 00000000:08:00.0 Off |                  N/A |
| 30%   29C    P0             108W / 350W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090        Off | 00000000:45:00.0 Off |                  N/A |
| 30%   34C    P0             116W / 350W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090        Off | 00000000:46:00.0 Off |                  N/A |
| 30%   39C    P0             110W / 350W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce RTX 3090        Off | 00000000:89:00.0 Off |                  N/A |
| 30%   36C    P0             111W / 350W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce RTX 3090        Off | 00000000:8A:00.0 Off |                  N/A |
| 30%   34C    P0             111W / 350W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce RTX 3090        Off | 00000000:C5:00.0 Off |                  N/A |
| 30%   42C    P0             110W / 350W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce RTX 3090        Off | 00000000:C6:00.0 Off |                  N/A |
| 30%   35C    P0             108W / 350W |      0MiB / 24576MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Docker config

{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

Containerd config

version = 2

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-cdi]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-cdi.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi"

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-experimental.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-legacy]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia-legacy.options]
            BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy"
santurini commented 1 week ago

@elezar @klueska I re-tried the experiment on a H100 obtaining the same error when deploying the nvidia-device-plugin. Same pod terminated with exactly same logs, but if I deploy only the gpu-operator it successfully deploys and is able to find the NVML library. Could you please help me?

santurini commented 1 week ago

Solved following #816 but I do not why it worked.