NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.69k stars 609 forks source link

"nvidia-smi": executable file not found in $PATH: unknown #346

Open devriesewouter89 opened 1 year ago

devriesewouter89 commented 1 year ago

1. Issue or feature description

When booting a container on k8s (via k3s) I notice my container doesn't contain "nvidia-smi" in /usr/bin or elsewhere. When I launch the same image/container not via k8s I do get the command.

vex@vex-slave4:~$ kubectl exec -it gpu -- nvidia-smi
error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "447b3dd0509b66403603e0c66fa7c524259d111afc3db4c41ce59498d58bb8c6": OCI runtime exec failed: exec failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown

2. Steps to reproduce the issue

my deployment yaml:

apiVersion: v1
kind: Pod
metadata:
  name: gpu
spec:
  restartPolicy: Never
  containers:
    - name: gpu
      image: "nvidia/cuda:11.4.1-base-ubuntu20.04"
      command: [ "/bin/bash", "-c", "--" ]
      args: [ "while true; do sleep 30; done;" ]
      resources:
        limits:
          nvidia.com/gpu: 1

Using different base images don't change the issue. Yet the weird thing is, if I'm running the same base image via docker, the nvidia-smi command is recognized: sudo ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.4.1-base-ubuntu18.04 cuda-11.4.1-base-ubuntu18.04 nvidia-smi returns

Tue Nov 22 15:10:28 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro K2200        On   | 00000000:01:00.0 Off |                  N/A |
| 42%   42C    P8     1W /  39W |      1MiB /  4043MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+                                                                         
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |

3. Information to attach (optional if deemed irrelevant)

Common error checking:

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found |

 - [ ] Your docker configuration file (e.g: `/etc/docker/daemon.json`)

{ "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } }

 - [ ] The k8s-device-plugin container logs

 - [ ] The kubelet logs on the node (e.g: `sudo journalctl -r -u kubelet`)
 - [ ] pod description

Name: gpu Namespace: default Priority: 0 Service Account: default Node:
Start Time: Tue, 22 Nov 2022 15:28:31 +0100 Labels: Annotations: Status: Running IP:
IPs: IP:
Containers: gpu: Container ID: containerd://68707cec263eb1bfaec27357d9f6c07b2545278183fe875dd5f43ea5de77c1b3 Image: nvidia/cuda:11.4.1-base-ubuntu20.04 Image ID: docker.io/nvidia/cuda@sha256:a838c93bcb191de297b04a04b6dc8a7c50983243562201a8d057f3ccdb1e7276 Port: Host Port: Command: /bin/bash -c

Args:
  while true; do sleep 30; done;
State:          Running
  Started:      Tue, 22 Nov 2022 15:28:35 +0100
Ready:          True
Restart Count:  0
Limits:
  nvidia.com/gpu:  1
Requests:
  nvidia.com/gpu:  1
Environment:       <none>
Mounts:
  /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wtv6z (ro)

Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: kube-api-access-wtv6z: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message


Normal Scheduled 47m default-scheduler Successfully assigned default/gpu to vex-slave5 Normal Pulling 47m kubelet Pulling image "nvidia/cuda:11.4.1-base-ubuntu20.04" Normal Pulled 47m kubelet Successfully pulled image "nvidia/cuda:11.4.1-base-ubuntu20.04" in 3.297786146s Normal Created 47m kubelet Created container gpu Normal Started 47m kubelet Started container gpu

Additional information that might help better understand your environment and reproduce the bug:
 - [ ] Docker version from `docker version`

Client: Version: 20.10.5+dfsg1 API version: 1.41 Go version: go1.15.15 Git commit: 55c4c88 Built: Mon May 30 18:34:49 2022 OS/Arch: linux/amd64 Context: default Experimental: true Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?


 - [ ] Docker command, image and tag used
 - [ ] Kernel version from `uname -a`
 `Linux vex-slave4 5.10.0-19-amd64 #1 SMP Debian 5.10.149-2 (2022-10-21) x86_64 GNU/Linux`
 - [ ] Any relevant kernel output lines from `dmesg`

[ 1954.607181] cni0: port 1(vethe2dc367a) entered disabled state [ 1954.608114] device vethe2dc367a left promiscuous mode [ 1954.608118] cni0: port 1(vethe2dc367a) entered disabled state [ 1957.373344] cni0: port 1(vethf4f0a873) entered blocking state [ 1957.373346] cni0: port 1(vethf4f0a873) entered disabled state [ 1957.374365] device vethf4f0a873 entered promiscuous mode [ 1957.375452] cni0: port 2(veth01e926e2) entered blocking state [ 1957.375454] cni0: port 2(veth01e926e2) entered disabled state [ 1957.376797] device veth01e926e2 entered promiscuous mode [ 1957.381302] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 1957.381634] IPv6: ADDRCONF(NETDEV_CHANGE): vethf4f0a873: link becomes ready [ 1957.381705] cni0: port 1(vethf4f0a873) entered blocking state [ 1957.381706] cni0: port 1(vethf4f0a873) entered forwarding state [ 1957.383274] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 1957.383580] IPv6: ADDRCONF(NETDEV_CHANGE): veth01e926e2: link becomes ready [ 1957.383648] cni0: port 2(veth01e926e2) entered blocking state [ 1957.383650] cni0: port 2(veth01e926e2) entered forwarding state [ 1957.570109] cni0: port 1(vethf4f0a873) entered disabled state [ 1957.570963] device vethf4f0a873 left promiscuous mode [ 1957.570966] cni0: port 1(vethf4f0a873) entered disabled state [ 1957.602816] cni0: port 2(veth01e926e2) entered disabled state [ 1957.603670] device veth01e926e2 left promiscuous mode [ 1957.603672] cni0: port 2(veth01e926e2) entered disabled state

 - [ ] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`

Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-======================================-====================-============-================================================================= un bumblebee-nvidia (no description available) un firmware-nvidia-gsp (no description available) un firmware-nvidia-gsp-470.141.03 (no description available) ii glx-alternative-nvidia 1.2.1~deb11u1 amd64 allows the selection of NVIDIA as GLX provider un libegl-nvidia-legacy-390xx0 (no description available) un libegl-nvidia-tesla-418-0 (no description available) un libegl-nvidia-tesla-450-0 (no description available) un libegl-nvidia-tesla-470-0 (no description available) ii libegl-nvidia0:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary EGL library un libegl1-glvnd-nvidia (no description available) un libegl1-nvidia (no description available) un libgl1-glvnd-nvidia-glx (no description available) ii libgl1-nvidia-glvnd-glx:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary OpenGL/GLX library (GLVND variant) un libgl1-nvidia-glx (no description available) un libgl1-nvidia-glx-any (no description available) un libgl1-nvidia-glx-i386 (no description available) un libgl1-nvidia-legacy-390xx-glx (no description available) un libgl1-nvidia-tesla-418-glx (no description available) un libgldispatch0-nvidia (no description available) ii libgles-nvidia1:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary OpenGL|ES 1.x library ii libgles-nvidia2:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary OpenGL|ES 2.x library un libgles1-glvnd-nvidia (no description available) un libgles2-glvnd-nvidia (no description available) un libglvnd0-nvidia (no description available) ii libglx-nvidia0:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary GLX library un libglx0-glvnd-nvidia (no description available) ii libnvidia-cbl:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary Vulkan ray tracing (cbl) library un libnvidia-cbl-470.141.03 (no description available) un libnvidia-cfg.so.1 (no description available) ii libnvidia-cfg1:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary OpenGL/GLX configuration library un libnvidia-cfg1-any (no description available) ii libnvidia-container-tools 1.11.0-1 amd64 NVIDIA container runtime library (command-line tools) ii libnvidia-container1:amd64 1.11.0-1 amd64 NVIDIA container runtime library ii libnvidia-egl-wayland1:amd64 1:1.1.5-1 amd64 Wayland EGL External Platform library -- shared library ii libnvidia-eglcore:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary EGL core libraries un libnvidia-eglcore-470.141.03 (no description available) ii libnvidia-encode1:amd64 470.141.03-1~deb11u1 amd64 NVENC Video Encoding runtime library un libnvidia-gl-390 (no description available) un libnvidia-gl-410 (no description available) ii libnvidia-glcore:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary OpenGL/GLX core libraries un libnvidia-glcore-470.141.03 (no description available) ii libnvidia-glvkspirv:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary Vulkan Spir-V compiler library un libnvidia-glvkspirv-470.141.03 (no description available) un libnvidia-legacy-340xx-cfg1 (no description available) un libnvidia-legacy-390xx-cfg1 (no description available) un libnvidia-legacy-390xx-egl-wayland1 (no description available) un libnvidia-ml.so.1 (no description available) ii libnvidia-ml1:amd64 470.141.03-1~deb11u1 amd64 NVIDIA Management Library (NVML) runtime library ii libnvidia-ptxjitcompiler1:amd64 470.141.03-1~deb11u1 amd64 NVIDIA PTX JIT Compiler library ii libnvidia-rtcore:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary Vulkan ray tracing (rtcore) library un libnvidia-rtcore-470.141.03 (no description available) un libnvidia-tesla-418-cfg1 (no description available) un libnvidia-tesla-450-cfg1 (no description available) un libnvidia-tesla-470-cfg1 (no description available) un libnvidia-tesla-510-cfg1 (no description available) un libopengl0-glvnd-nvidia (no description available) ii nvidia-alternative 470.141.03-1~deb11u1 amd64 allows the selection of NVIDIA as GLX provider un nvidia-alternative--kmod-alias (no description available) un nvidia-alternative-any (no description available) un nvidia-alternative-legacy-173xx (no description available) un nvidia-alternative-legacy-71xx (no description available) un nvidia-alternative-legacy-96xx (no description available) un nvidia-container-runtime (no description available) un nvidia-container-runtime-hook (no description available) ii nvidia-container-toolkit 1.11.0-1 amd64 NVIDIA Container toolkit ii nvidia-container-toolkit-base 1.11.0-1 amd64 NVIDIA Container Toolkit Base un nvidia-cuda-mps (no description available) un nvidia-current (no description available) un nvidia-current-updates (no description available) ii nvidia-detect 470.141.03-1~deb11u1 amd64 NVIDIA GPU detection utility un nvidia-docker (no description available) ii nvidia-docker2 2.11.0-1 all nvidia-docker CLI wrapper ii nvidia-driver 470.141.03-1~deb11u1 amd64 NVIDIA metapackage un nvidia-driver-any (no description available) ii nvidia-driver-bin 470.141.03-1~deb11u1 amd64 NVIDIA driver support binaries un nvidia-driver-bin-470.141.03 (no description available) un nvidia-driver-binary (no description available) ii nvidia-driver-libs:amd64 470.141.03-1~deb11u1 amd64 NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries) un nvidia-driver-libs-any (no description available) un nvidia-driver-libs-nonglvnd (no description available) ii nvidia-egl-common 470.141.03-1~deb11u1 amd64 NVIDIA binary EGL driver - common files ii nvidia-egl-icd:amd64 470.141.03-1~deb11u1 amd64 NVIDIA EGL installable client driver (ICD) un nvidia-egl-wayland-common (no description available) un nvidia-glx-any (no description available) ii nvidia-installer-cleanup 20151021+13 amd64 cleanup after driver installation with the nvidia-installer un nvidia-kernel-470.141.03 (no description available) ii nvidia-kernel-common 20151021+13 amd64 NVIDIA binary kernel module support files ii nvidia-kernel-dkms 470.141.03-1~deb11u1 amd64 NVIDIA binary kernel module DKMS source un nvidia-kernel-source (no description available) ii nvidia-kernel-support 470.141.03-1~deb11u1 amd64 NVIDIA binary kernel module support files un nvidia-kernel-support--v1 (no description available) un nvidia-kernel-support-any (no description available) un nvidia-legacy-304xx-alternative (no description available) un nvidia-legacy-304xx-driver (no description available) un nvidia-legacy-340xx-alternative (no description available) un nvidia-legacy-390xx-vulkan-icd (no description available) ii nvidia-legacy-check 470.141.03-1~deb11u1 amd64 check for NVIDIA GPUs requiring a legacy driver ii nvidia-modprobe 470.103.01-1~deb11u1 amd64 utility to load NVIDIA kernel modules and create device nodes un nvidia-nonglvnd-vulkan-common (no description available) un nvidia-nonglvnd-vulkan-icd (no description available) ii nvidia-persistenced 470.103.01-2~deb11u1 amd64 daemon to maintain persistent software state in the NVIDIA driver ii nvidia-settings 470.141.03-1~deb11u1 amd64 tool for configuring the NVIDIA graphics driver un nvidia-settings-gtk-470.141.03 (no description available) ii nvidia-smi 470.141.03-1~deb11u1 amd64 NVIDIA System Management Interface ii nvidia-support 20151021+13 amd64 NVIDIA binary graphics driver support files un nvidia-tesla-418-vulkan-icd (no description available) un nvidia-tesla-450-vulkan-icd (no description available) un nvidia-tesla-470-vulkan-icd (no description available) un nvidia-tesla-alternative (no description available) ii nvidia-vdpau-driver:amd64 470.141.03-1~deb11u1 amd64 Video Decode and Presentation API for Unix - NVIDIA driver ii nvidia-vulkan-common 470.141.03-1~deb11u1 amd64 NVIDIA Vulkan driver - common files ii nvidia-vulkan-icd:amd64 470.141.03-1~deb11u1 amd64 NVIDIA Vulkan installable client driver (ICD) un nvidia-vulkan-icd-any (no description available) ii xserver-xorg-video-nvidia 470.141.03-1~deb11u1 amd64 NVIDIA binary Xorg driver un xserver-xorg-video-nvidia-any (no description available) un xserver-xorg-video-nvidia-legacy-304xx (no description available)


 - [ ] NVIDIA container library version from `nvidia-container-cli -V`

cli-version: 1.11.0 lib-version: 1.11.0 build date: 2022-09-06T09:21+00:00 build revision: c8f267be0bac1c654d59ad4ea5df907141149977 build compiler: x86_64-linux-gnu-gcc-8 8.3.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections


 - [ ] NVIDIA container library logs (see [troubleshooting](https://github.com/NVIDIA/nvidia-docker/wiki/Troubleshooting))
klueska commented 1 year ago

I'm assuming you are using containerd, not docker as the runtime you have configured for kubernetes (that has been the default since v1.20).

Do you have nvidia set up as your default runtime for containerd as described here: https://github.com/NVIDIA/k8s-device-plugin#configure-containerd

The path used by ctr and the way kubernetes hooks into containerd are different ,so if it works under ctr that doesn't mean it will work under k8s. You need to have containerd's cri plugin configured to use the nvidia runtime by default, as described in the link above.

elezar commented 1 year ago

@devriesewouter89 note that k3s uses a speicific containerd config template and configures the NVIDIA Container Runtime if this is installed on the system at startup. Note that this doesn't set the default runtime. One option is to use a RuntimeClass when launching pods that are supposed to use have access to GPUs.

elezar commented 1 year ago

See also https://github.com/NVIDIA/k8s-device-plugin/issues/306

AgentScrubbles commented 8 months ago

Throwing in my comment, exact same use case and have been in OP's exact shoes. If anyone stumbles on this, make sure you follow the docs and create a runtime class for Nvidia. @elezar is right, K3S will do most of the hookups for you, you no longer need to modify tomls or templates for nvidia, but you do need to create a runtime class like:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia

Then for your containers you can use that class specifically as:

spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
Lodeon commented 4 months ago

Throwing in my comment, exact same use case and have been in OP's exact shoes. If anyone stumbles on this, make sure you follow the docs and create a runtime class for Nvidia. @elezar is right, K3S will do most of the hookups for you, you no longer need to modify tomls or templates for nvidia, but you do need to create a runtime class like:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia

Then for your containers you can use that class specifically as:

spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:

Thank you so much! You just ended my hours-long search. I appreciate you taking the time to help us newbies out.