NVIDIA / k8s-dra-driver

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
Apache License 2.0
212 stars 38 forks source link

k8s-dra-driver-kubelet-plugin pod failed to run #65

Closed cyclinder closed 6 months ago

cyclinder commented 6 months ago

Hi teams, I followed the DEMO on the README, but k8s-dra-driver-kubelet-plugin pod failed to run due to Error: failed to create device library: failed to locate driver libraries: error locating "libnvidia-ml.so.1"

root@10-20-1-20:/home/cyclinder/k8s-dra-driver# kubectl get po -n nvidia-dra-driver
NAME                                               READY   STATUS             RESTARTS        AGE
nvidia-k8s-dra-driver-controller-6d6b45756-js9st   1/1     Running            0               9m12s
nvidia-k8s-dra-driver-kubelet-plugin-xckkd         0/1     CrashLoopBackOff   6 (3m29s ago)   9m12s
root@10-20-1-20:/home/cyclinder/k8s-dra-driver# kubectl logs -f -n nvidia-dra-driver nvidia-k8s-dra-driver-kubelet-plugin-xckkd
Defaulted container "plugin" out of: plugin, init (init)
Error: failed to create device library: failed to locate driver libraries: error locating "libnvidia-ml.so.1"

root@nvidia-k8s-dra-driver-kubelet-plugin-s9fb7:/driver-root/usr/lib/x86_64-linux-gnu# ll libnvidia-ml.so.545.23.08
-rw-r--r-- 1 root root 1992128 Nov  6 23:23 libnvidia-ml.so.545.23.08
cyclinder commented 6 months ago

on the host:

root@10-20-1-20:/# find / -type f -name "libnvidia-ml.*"
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.545.23.08
root@10-20-1-20:/# ll /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.545.23.08
-rw-r--r-- 1 root root 1992128 Nov  6 23:23 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.545.23.08
root@10-20-1-20:/# ln -sf /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.545.23.08 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
root@10-20-1-20:/usr/lib/x86_64-linux-gnu# ll libnvidia-ml.so.1
lrwxrwxrwx 1 root root 51 Jan 29 20:40 libnvidia-ml.so.1 -> /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.545.23.08
klueska commented 6 months ago

Are you able to run nvidia-smi on the host? Are you also able to docker exec into the docker container representing the k8s worker node and run nvidia-smi?

cyclinder commented 6 months ago

But when I tried to create symbolic link in the pod, and it failed:

root@nvidia-k8s-dra-driver-kubelet-plugin-s9fb7:/# ln -sf /driver-root/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /driver-root/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.545.23.08
ln: failed to create symbolic link '/driver-root/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.545.23.08': Read-only file system
cyclinder commented 6 months ago

Thansk @klueska

nvidia-smi on the host:

root@10-20-1-20:/home/cyclinder/k8s-dra-driver# nvidia-smi
Mon Jan 29 22:51:46 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P4                       On  | 00000000:2F:00.0 Off |                  Off |
| N/A   29C    P8               6W /  75W |      0MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

root@10-20-1-20:/home/cyclinder/k8s-dra-driver# docker ps
CONTAINER ID   IMAGE                                                       COMMAND                  CREATED          STATUS          PORTS                       NAMES
c032bf344b55   kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1   "/usr/local/bin/entr…"   22 minutes ago   Up 22 minutes                               k8s-dra-driver-cluster-worker
f30e7648a752   kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1   "/usr/local/bin/entr…"   22 minutes ago   Up 22 minutes   127.0.0.1:16681->6443/tcp   k8s-dra-driver-cluster-control-plane
root@10-20-1-20:/home/cyclinder/k8s-dra-driver# docker exec -it c032bf344b55 bash
root@k8s-dra-driver-cluster-worker:/#
root@k8s-dra-driver-cluster-worker:/# nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
cyclinder commented 6 months ago

on the kind-worker node:

root@10-20-1-20:/home/cyclinder/k8s-dra-driver# docker exec -it c032bf344b55 bash
root@k8s-dra-driver-cluster-worker:/#
root@k8s-dra-driver-cluster-worker:/#
root@k8s-dra-driver-cluster-worker:/# ls /usr/
bin/     games/   include/ lib/     libexec/ local/   sbin/    share/   src/
root@k8s-dra-driver-cluster-worker:/# ls /usr/lib
lib/     libexec/
root@k8s-dra-driver-cluster-worker:/# ll /usr/lib/x86_64-linux-gnu/libn
libnetfilter_conntrack.so.3             libnftnl.so.11                          libnvidia-allocator.so.545.23.08        libnvidia-pkcs11-openssl3.so.545.23.08
libnetfilter_conntrack.so.3.7.0         libnftnl.so.11.5.0                      libnvidia-cfg.so.545.23.08              libnvidia-pkcs11.so.545.23.08
libnettle.so.8                          libnghttp2.so.14                        libnvidia-gpucomp.so.545.23.08          libnvidia-ptxjitcompiler.so.545.23.08
libnettle.so.8.4                        libnghttp2.so.14.20.1                   libnvidia-ml.so.545.23.08
libnfnetlink.so.0                       libnsl.so.2                             libnvidia-nvvm.so.545.23.08
libnfnetlink.so.0.2.0                   libnsl.so.2.0.1                         libnvidia-opencl.so.545.23.08
root@k8s-dra-driver-cluster-worker:/# ll /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.545.23.08
bash: ll: command not found
root@k8s-dra-driver-cluster-worker:/# ls /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.545.23.08
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.545.23.08
klueska commented 6 months ago

OK, so it seems the node that kind is starting does not have things set up properly (which is odd since the libs are obviously injected in there). @elezar any idea what could be going on?

elezar commented 6 months ago

There has been at least one user reporting issues with the v1.14.4 toolkit. Which version of the NVIDIA Contianer Toolkit is installed on the host?

cyclinder commented 6 months ago

Hi @elezar, Do you say it's nvidia-container-toolkit ?

root@10-20-1-20:/home/cyclinder/k8s-dra-driver# nvidia-container-toolkit -version
NVIDIA Container Runtime Hook version 1.14.4
commit: d167812ce3a55ec04ae2582eff1654ec812f42e1
elezar commented 6 months ago

The behaviour is similar to that described in https://github.com/NVIDIA/nvidia-container-toolkit/issues/305.

Please downgrade the NVIDIA Container toolkit by running:

sudo apt-get install nvidia-container-toolkit=1.14.3-1 \
        nvidia-container-toolkit-base=1.14.3-1 \
        libnvidia-container-tools=1.14.3-1 \
        libnvidia-container1=1.14.3-1

for the time being. I will report back with some findings.

cyclinder commented 6 months ago

@elezar same issue as the report.

root@10-20-1-20:/home/cyclinder/k8s-dra-driver# docker ps
CONTAINER ID   IMAGE                                                       COMMAND                  CREATED          STATUS          PORTS                       NAMES
282fd27f597b   kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1   "/usr/local/bin/entr…"   51 seconds ago   Up 48 seconds   127.0.0.1:12887->6443/tcp   k8s-dra-driver-cluster-control-plane
17851e9aeeb4   kindest/node:v1.27.1-v20230515-01914134-containerd_v1.7.1   "/usr/local/bin/entr…"   51 seconds ago   Up 49 seconds                               k8s-dra-driver-cluster-worker
root@10-20-1-20:/home/cyclinder/k8s-dra-driver# docker exec -it 17851e9aeeb4 bash
root@k8s-dra-driver-cluster-worker:/#
root@k8s-dra-driver-cluster-worker:/# nvidia-smi -l
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
root@k8s-dra-driver-cluster-worker:/#
root@k8s-dra-driver-cluster-worker:/#
root@k8s-dra-driver-cluster-worker:/# exit
exit
root@10-20-1-20:/home/cyclinder/k8s-dra-driver# nvidia-container-
nvidia-container-cli           nvidia-container-runtime       nvidia-container-runtime-hook  nvidia-container-toolkit
root@10-20-1-20:/home/cyclinder/k8s-dra-driver# nvidia-container-toolkit -version
NVIDIA Container Runtime Hook version 1.14.3
commit: 53b24618a542025b108239fe602e66e912b7d6e2
elezar commented 6 months ago

OK. I have just created a kind cluster using toolkit v1.14.4 and this does have access to the devices.

Can we confirm that the toolkit is configured correctly:

  1. Please upgrade to v1.14.4 to ensure that we're running the same version.
  2. Provide the output of nvidia-ctk config
  3. Provide the output of nvidia-ctk runtime configure --dry-run
  4. Provide the output of docker info | grep Runtime

Also for completeness, please confirm that you're running:

./demo/clusters/kind/create-cluster.sh

in an unmodified local copy of this repository.

cyclinder commented 6 months ago
root@10-20-1-20:/home/cyclinder/k8s-dra-driver# nvidia-container-toolkit -version
NVIDIA Container Runtime Hook version 1.14.4
commit: d167812ce3a55ec04ae2582eff1654ec812f42e1
root@10-20-1-20:/home/cyclinder/k8s-dra-driver# ls
api  cmd  common.mk  CONTRIBUTING.md  demo  deployments  go.mod  go.sum  hack  internal  LICENSE  Makefile  pkg  README.md  templates  vendor  versions.mk

root@10-20-1-20:/home/cyclinder/k8s-dra-driver# nvidia-ctk config
accept-nvidia-visible-devices-as-volume-mounts = true
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = false
#path = "/usr/bin/nvidia-container-cli"
#root = "/run/nvidia/driver"
#user = "root:video"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
mode = "auto"
runtimes = ["docker-runc", "runc", "crun"]

[nvidia-container-runtime.modes]

[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

[nvidia-container-runtime-hook]
path = "nvidia-container-runtime-hook"
skip-mode-detection = false

[nvidia-ctk]
path = "nvidia-ctk"

root@10-20-1-20:/home/cyclinder/k8s-dra-driver# nvidia-ctk runtime configure --dry-run
INFO[0000] Loading config from /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}
root@10-20-1-20:/home/cyclinder/k8s-dra-driver# docker info | grep Runtime
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: nvidia
cyclinder commented 6 months ago

Yeah, I modified some local files indeed

root@10-20-1-20:/home/cyclinder/k8s-dra-driver# git diff demo/clusters/kind/scripts/common.sh
diff --git a/demo/clusters/kind/scripts/common.sh b/demo/clusters/kind/scripts/common.sh
index 2398ed4..6ed1320 100644
--- a/demo/clusters/kind/scripts/common.sh
+++ b/demo/clusters/kind/scripts/common.sh
@@ -30,7 +30,7 @@ DRIVER_IMAGE_REGISTRY=$(from_versions_mk "REGISTRY")
 DRIVER_IMAGE_VERSION=$(from_versions_mk "VERSION")

 : ${DRIVER_IMAGE_NAME:=${DRIVER_NAME}}
-: ${DRIVER_IMAGE_PLATFORM:="ubuntu20.04"}
+: ${DRIVER_IMAGE_PLATFORM:="ubuntu22.04"}
 : ${DRIVER_IMAGE_TAG:=${DRIVER_IMAGE_VERSION}}
 # The derived name of the driver image to build
 : ${DRIVER_IMAGE:="${DRIVER_IMAGE_REGISTRY}/${DRIVER_IMAGE_NAME}:${DRIVER_IMAGE_TAG}"}
@@ -47,6 +47,6 @@ DRIVER_IMAGE_VERSION=$(from_versions_mk "VERSION")

 # The derived name of the kind image to build
 : ${KIND_IMAGE_BASE_TAG:="v20230515-01914134-containerd_v1.7.1"}
-: ${KIND_IMAGE_BASE:="gcr.io/k8s-staging-kind/base:${KIND_IMAGE_BASE_TAG}"}
+: ${KIND_IMAGE_BASE:="gcr.m.daocloud.io/k8s-staging-kind/base:${KIND_IMAGE_BASE_TAG}"}
 : ${KIND_IMAGE:="kindest/node:${KIND_K8S_TAG}-${KIND_IMAGE_BASE_TAG}"}

root@10-20-1-20:/home/cyclinder/k8s-dra-driver# git diff demo/clusters/kind/scripts/kind-cluster-config.yaml
diff --git a/demo/clusters/kind/scripts/kind-cluster-config.yaml b/demo/clusters/kind/scripts/kind-cluster-config.yaml
index d2f03bc..e2211f3 100644
--- a/demo/clusters/kind/scripts/kind-cluster-config.yaml
+++ b/demo/clusters/kind/scripts/kind-cluster-config.yaml
@@ -1,33 +1,17 @@
-# Copyright 2023 The Kubernetes Authors.
-# Copyright 2023 NVIDIA CORPORATION.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-kind: Cluster
 apiVersion: kind.x-k8s.io/v1alpha4
-featureGates:
-  DynamicResourceAllocation: true
 containerdConfigPatches:
-# Enable CDI as described in
-# https://tags.cncf.io/container-device-interface#containerd-configuration
 - |-
   [plugins."io.containerd.grpc.v1.cri"]
     enable_cdi = true
+    sandbox_image = "k8s.m.daocloud.io/pause:3.7"
+featureGates:
+  DynamicResourceAllocation: true
+kind: Cluster
 nodes:
-- role: control-plane
-  kubeadmConfigPatches:
+- kubeadmConfigPatches:
   - |
     kind: ClusterConfiguration
+    imageRepository: k8s.m.daocloud.io
     apiServer:
         extraArgs:
           runtime-config: "resource.k8s.io/v1alpha2=true"
@@ -42,6 +26,7 @@ nodes:
     nodeRegistration:
       kubeletExtraArgs:
         v: "1"
+  role: control-plane
 - role: worker
   kubeadmConfigPatches:
   - |
@@ -62,3 +47,4 @@ nodes:
elezar commented 6 months ago

The key to ensuring injection of the driver into the worker nodes is adding:

  extraMounts:
  # We inject all NVIDIA GPUs using the nvidia-container-runtime.
  # This requires `accept-nvidia-visible-devices-as-volume-mounts = true` be set
  # in `/etc/nvidia-container-runtime/config.toml`
  - hostPath: /dev/null
    containerPath: /var/run/nvidia-container-devices/all

to the node config. Note that at the moment we also mount nvidia-ctk from the host, but this should ideally be installed in the kind worker node.

Do you have an extraMounts section in our Kind config for your worker node?

cyclinder commented 6 months ago

yeah, I made a point of making sure it was added.

root@10-20-1-20:/home/cyclinder/k8s-dra-driver# cat demo/clusters/kind/scripts/kind-cluster-config.yaml
apiVersion: kind.x-k8s.io/v1alpha4
containerdConfigPatches:
- |-
  [plugins."io.containerd.grpc.v1.cri"]
    enable_cdi = true
    sandbox_image = "k8s.m.daocloud.io/pause:3.7"
featureGates:
  DynamicResourceAllocation: true
kind: Cluster
nodes:
- kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    imageRepository: k8s.m.daocloud.io
    apiServer:
        extraArgs:
          runtime-config: "resource.k8s.io/v1alpha2=true"
    scheduler:
        extraArgs:
          v: "1"
    controllerManager:
        extraArgs:
          v: "1"
  - |
    kind: InitConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        v: "1"
  role: control-plane
- role: worker
  kubeadmConfigPatches:
  - |
    kind: JoinConfiguration
    nodeRegistration:
      kubeletExtraArgs:
        v: "1"
  extraMounts:
  # We inject all NVIDIA GPUs using the nvidia-container-runtime.
  # This requires `accept-nvidia-visible-devices-as-volume-mounts = true` be set
  # in `/etc/nvidia-container-runtime/config.toml`
  - hostPath: /dev/null
    containerPath: /var/run/nvidia-container-devices/all
  # The generated CDI specification assumes that `nvidia-ctk` is available on a
  # node -- specifically for the `nvidia-ctk hook` subcommand. As a workaround,
  # we mount it from the host.
  # TODO: Remove this once we have a more stable solution to make `nvidia-ctk`
  # on the kind nodes.
  - hostPath: /usr/bin/nvidia-ctk
    containerPath: /usr/bin/nvidia-ctk
cyclinder commented 6 months ago

It works well after I updated the configuration in /etc/nvidia-container-runtime/config.toml: ldconfig = "/sbin/ldconfig.real" -> ldconfig = "@/sbin/ldconfig.real"

Let me close it now, thanks for the help @elezar @klueska

elezar commented 6 months ago

It works well after I updated the configuration in /etc/nvidia-container-runtime/config.toml: ldconfig = "/sbin/ldconfig.real" -> ldconfig = "@/sbin/ldconfig.real"

Let me close it now, thanks for the help @elezar @klueska

Thanks for the update @cyclinder. Do you know why your config file was referring to /sbin/ldconfig.real and not @/sbin/ldconfig.real? (Which distribution are you using)

cyclinder commented 6 months ago

I'm using Ubunu22-04, but I don't know why it changed, the default value should be @/sbin/ldconfig.real, Right?

elezar commented 6 months ago

Yes, the value should be @/sbin/ldconfig.real on ubuntu-based systems. There was some logic added to better detect this value when installing the packages and it could be that some edge-case caused this to fail. I will look at our changes and assess whether there is a bug that needs addressing.

cyclinder commented 6 months ago

Thanks for the job! how are these files generated and who reads the file? Can I find those details? I'm trying to understand it :)

elezar commented 6 months ago

@cyclinder the config file is installed as part of the nvidia-container-toolkit-base package. Starting with the 1.14.0 release it is generated for the target distribution instead of for each supported distribution.

For debian / ubuntu packages that is defined here: https://github.com/NVIDIA/nvidia-container-toolkit/blob/2f3600af9aa46afc84f0e422cba75f9f6e884c21/packaging/debian/nvidia-container-toolkit-base.postinst#L7

cyclinder commented 6 months ago

Thanks for the explanation @elezar! I'm going to take the time to learn this.