Open devriesewouter89 opened 1 year ago
I'm assuming you are using containerd
, not docker
as the runtime you have configured for kubernetes (that has been the default since v1.20).
Do you have nvidia
set up as your default runtime for containerd
as described here:
https://github.com/NVIDIA/k8s-device-plugin#configure-containerd
The path used by ctr
and the way kubernetes hooks into containerd
are different ,so if it works under ctr
that doesn't mean it will work under k8s
. You need to have containerd's cri
plugin configured to use the nvidia
runtime by default, as described in the link above.
@devriesewouter89 note that k3s
uses a speicific containerd
config template and configures the NVIDIA Container Runtime if this is installed on the system at startup. Note that this doesn't set the default runtime. One option is to use a RuntimeClass
when launching pods that are supposed to use have access to GPUs.
Throwing in my comment, exact same use case and have been in OP's exact shoes. If anyone stumbles on this, make sure you follow the docs and create a runtime class for Nvidia. @elezar is right, K3S will do most of the hookups for you, you no longer need to modify tomls or templates for nvidia, but you do need to create a runtime class like:
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
Then for your containers you can use that class specifically as:
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
Throwing in my comment, exact same use case and have been in OP's exact shoes. If anyone stumbles on this, make sure you follow the docs and create a runtime class for Nvidia. @elezar is right, K3S will do most of the hookups for you, you no longer need to modify tomls or templates for nvidia, but you do need to create a runtime class like:
apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: nvidia handler: nvidia
Then for your containers you can use that class specifically as:
spec: restartPolicy: OnFailure runtimeClassName: nvidia containers:
Thank you so much! You just ended my hours-long search. I appreciate you taking the time to help us newbies out.
1. Issue or feature description
When booting a container on k8s (via k3s) I notice my container doesn't contain "nvidia-smi" in /usr/bin or elsewhere. When I launch the same image/container not via k8s I do get the command.
2. Steps to reproduce the issue
my deployment yaml:
Using different base images don't change the issue. Yet the weird thing is, if I'm running the same base image via docker, the nvidia-smi command is recognized:
sudo ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.4.1-base-ubuntu18.04 cuda-11.4.1-base-ubuntu18.04 nvidia-smi
returns3. Information to attach (optional if deemed irrelevant)
Common error checking:
nvidia-smi -a
on your host+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found |
{ "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } }
Name: gpu Namespace: default Priority: 0 Service Account: default Node:
Annotations:
Status: Running
IP:
Host Port:
Command:
/bin/bash
-c
Start Time: Tue, 22 Nov 2022 15:28:31 +0100 Labels:
IPs: IP:
Containers: gpu: Container ID: containerd://68707cec263eb1bfaec27357d9f6c07b2545278183fe875dd5f43ea5de77c1b3 Image: nvidia/cuda:11.4.1-base-ubuntu20.04 Image ID: docker.io/nvidia/cuda@sha256:a838c93bcb191de297b04a04b6dc8a7c50983243562201a8d057f3ccdb1e7276 Port:
Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: kube-api-access-wtv6z: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
Normal Scheduled 47m default-scheduler Successfully assigned default/gpu to vex-slave5 Normal Pulling 47m kubelet Pulling image "nvidia/cuda:11.4.1-base-ubuntu20.04" Normal Pulled 47m kubelet Successfully pulled image "nvidia/cuda:11.4.1-base-ubuntu20.04" in 3.297786146s Normal Created 47m kubelet Created container gpu Normal Started 47m kubelet Started container gpu
Client: Version: 20.10.5+dfsg1 API version: 1.41 Go version: go1.15.15 Git commit: 55c4c88 Built: Mon May 30 18:34:49 2022 OS/Arch: linux/amd64 Context: default Experimental: true Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
[ 1954.607181] cni0: port 1(vethe2dc367a) entered disabled state [ 1954.608114] device vethe2dc367a left promiscuous mode [ 1954.608118] cni0: port 1(vethe2dc367a) entered disabled state [ 1957.373344] cni0: port 1(vethf4f0a873) entered blocking state [ 1957.373346] cni0: port 1(vethf4f0a873) entered disabled state [ 1957.374365] device vethf4f0a873 entered promiscuous mode [ 1957.375452] cni0: port 2(veth01e926e2) entered blocking state [ 1957.375454] cni0: port 2(veth01e926e2) entered disabled state [ 1957.376797] device veth01e926e2 entered promiscuous mode [ 1957.381302] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 1957.381634] IPv6: ADDRCONF(NETDEV_CHANGE): vethf4f0a873: link becomes ready [ 1957.381705] cni0: port 1(vethf4f0a873) entered blocking state [ 1957.381706] cni0: port 1(vethf4f0a873) entered forwarding state [ 1957.383274] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 1957.383580] IPv6: ADDRCONF(NETDEV_CHANGE): veth01e926e2: link becomes ready [ 1957.383648] cni0: port 2(veth01e926e2) entered blocking state [ 1957.383650] cni0: port 2(veth01e926e2) entered forwarding state [ 1957.570109] cni0: port 1(vethf4f0a873) entered disabled state [ 1957.570963] device vethf4f0a873 left promiscuous mode [ 1957.570966] cni0: port 1(vethf4f0a873) entered disabled state [ 1957.602816] cni0: port 2(veth01e926e2) entered disabled state [ 1957.603670] device veth01e926e2 left promiscuous mode [ 1957.603672] cni0: port 2(veth01e926e2) entered disabled state
Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-======================================-====================-============-================================================================= un bumblebee-nvidia (no description available)
un firmware-nvidia-gsp (no description available)
un firmware-nvidia-gsp-470.141.03 (no description available)
ii glx-alternative-nvidia 1.2.1~deb11u1 amd64 allows the selection of NVIDIA as GLX provider
un libegl-nvidia-legacy-390xx0 (no description available)
un libegl-nvidia-tesla-418-0 (no description available)
un libegl-nvidia-tesla-450-0 (no description available)
un libegl-nvidia-tesla-470-0 (no description available)
ii libegl-nvidia0:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary EGL library
un libegl1-glvnd-nvidia (no description available)
un libegl1-nvidia (no description available)
un libgl1-glvnd-nvidia-glx (no description available)
ii libgl1-nvidia-glvnd-glx:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary OpenGL/GLX library (GLVND variant)
un libgl1-nvidia-glx (no description available)
un libgl1-nvidia-glx-any (no description available)
un libgl1-nvidia-glx-i386 (no description available)
un libgl1-nvidia-legacy-390xx-glx (no description available)
un libgl1-nvidia-tesla-418-glx (no description available)
un libgldispatch0-nvidia (no description available)
ii libgles-nvidia1:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary OpenGL|ES 1.x library
ii libgles-nvidia2:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary OpenGL|ES 2.x library
un libgles1-glvnd-nvidia (no description available)
un libgles2-glvnd-nvidia (no description available)
un libglvnd0-nvidia (no description available)
ii libglx-nvidia0:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary GLX library
un libglx0-glvnd-nvidia (no description available)
ii libnvidia-cbl:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary Vulkan ray tracing (cbl) library
un libnvidia-cbl-470.141.03 (no description available)
un libnvidia-cfg.so.1 (no description available)
ii libnvidia-cfg1:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary OpenGL/GLX configuration library
un libnvidia-cfg1-any (no description available)
ii libnvidia-container-tools 1.11.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.11.0-1 amd64 NVIDIA container runtime library
ii libnvidia-egl-wayland1:amd64 1:1.1.5-1 amd64 Wayland EGL External Platform library -- shared library
ii libnvidia-eglcore:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary EGL core libraries
un libnvidia-eglcore-470.141.03 (no description available)
ii libnvidia-encode1:amd64 470.141.03-1~deb11u1 amd64 NVENC Video Encoding runtime library
un libnvidia-gl-390 (no description available)
un libnvidia-gl-410 (no description available)
ii libnvidia-glcore:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary OpenGL/GLX core libraries
un libnvidia-glcore-470.141.03 (no description available)
ii libnvidia-glvkspirv:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary Vulkan Spir-V compiler library
un libnvidia-glvkspirv-470.141.03 (no description available)
un libnvidia-legacy-340xx-cfg1 (no description available)
un libnvidia-legacy-390xx-cfg1 (no description available)
un libnvidia-legacy-390xx-egl-wayland1 (no description available)
un libnvidia-ml.so.1 (no description available)
ii libnvidia-ml1:amd64 470.141.03-1~deb11u1 amd64 NVIDIA Management Library (NVML) runtime library
ii libnvidia-ptxjitcompiler1:amd64 470.141.03-1~deb11u1 amd64 NVIDIA PTX JIT Compiler library
ii libnvidia-rtcore:amd64 470.141.03-1~deb11u1 amd64 NVIDIA binary Vulkan ray tracing (rtcore) library
un libnvidia-rtcore-470.141.03 (no description available)
un libnvidia-tesla-418-cfg1 (no description available)
un libnvidia-tesla-450-cfg1 (no description available)
un libnvidia-tesla-470-cfg1 (no description available)
un libnvidia-tesla-510-cfg1 (no description available)
un libopengl0-glvnd-nvidia (no description available)
ii nvidia-alternative 470.141.03-1~deb11u1 amd64 allows the selection of NVIDIA as GLX provider
un nvidia-alternative--kmod-alias (no description available)
un nvidia-alternative-any (no description available)
un nvidia-alternative-legacy-173xx (no description available)
un nvidia-alternative-legacy-71xx (no description available)
un nvidia-alternative-legacy-96xx (no description available)
un nvidia-container-runtime (no description available)
un nvidia-container-runtime-hook (no description available)
ii nvidia-container-toolkit 1.11.0-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.11.0-1 amd64 NVIDIA Container Toolkit Base
un nvidia-cuda-mps (no description available)
un nvidia-current (no description available)
un nvidia-current-updates (no description available)
ii nvidia-detect 470.141.03-1~deb11u1 amd64 NVIDIA GPU detection utility
un nvidia-docker (no description available)
ii nvidia-docker2 2.11.0-1 all nvidia-docker CLI wrapper
ii nvidia-driver 470.141.03-1~deb11u1 amd64 NVIDIA metapackage
un nvidia-driver-any (no description available)
ii nvidia-driver-bin 470.141.03-1~deb11u1 amd64 NVIDIA driver support binaries
un nvidia-driver-bin-470.141.03 (no description available)
un nvidia-driver-binary (no description available)
ii nvidia-driver-libs:amd64 470.141.03-1~deb11u1 amd64 NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries)
un nvidia-driver-libs-any (no description available)
un nvidia-driver-libs-nonglvnd (no description available)
ii nvidia-egl-common 470.141.03-1~deb11u1 amd64 NVIDIA binary EGL driver - common files
ii nvidia-egl-icd:amd64 470.141.03-1~deb11u1 amd64 NVIDIA EGL installable client driver (ICD)
un nvidia-egl-wayland-common (no description available)
un nvidia-glx-any (no description available)
ii nvidia-installer-cleanup 20151021+13 amd64 cleanup after driver installation with the nvidia-installer
un nvidia-kernel-470.141.03 (no description available)
ii nvidia-kernel-common 20151021+13 amd64 NVIDIA binary kernel module support files
ii nvidia-kernel-dkms 470.141.03-1~deb11u1 amd64 NVIDIA binary kernel module DKMS source
un nvidia-kernel-source (no description available)
ii nvidia-kernel-support 470.141.03-1~deb11u1 amd64 NVIDIA binary kernel module support files
un nvidia-kernel-support--v1 (no description available)
un nvidia-kernel-support-any (no description available)
un nvidia-legacy-304xx-alternative (no description available)
un nvidia-legacy-304xx-driver (no description available)
un nvidia-legacy-340xx-alternative (no description available)
un nvidia-legacy-390xx-vulkan-icd (no description available)
ii nvidia-legacy-check 470.141.03-1~deb11u1 amd64 check for NVIDIA GPUs requiring a legacy driver
ii nvidia-modprobe 470.103.01-1~deb11u1 amd64 utility to load NVIDIA kernel modules and create device nodes
un nvidia-nonglvnd-vulkan-common (no description available)
un nvidia-nonglvnd-vulkan-icd (no description available)
ii nvidia-persistenced 470.103.01-2~deb11u1 amd64 daemon to maintain persistent software state in the NVIDIA driver
ii nvidia-settings 470.141.03-1~deb11u1 amd64 tool for configuring the NVIDIA graphics driver
un nvidia-settings-gtk-470.141.03 (no description available)
ii nvidia-smi 470.141.03-1~deb11u1 amd64 NVIDIA System Management Interface
ii nvidia-support 20151021+13 amd64 NVIDIA binary graphics driver support files
un nvidia-tesla-418-vulkan-icd (no description available)
un nvidia-tesla-450-vulkan-icd (no description available)
un nvidia-tesla-470-vulkan-icd (no description available)
un nvidia-tesla-alternative (no description available)
ii nvidia-vdpau-driver:amd64 470.141.03-1~deb11u1 amd64 Video Decode and Presentation API for Unix - NVIDIA driver
ii nvidia-vulkan-common 470.141.03-1~deb11u1 amd64 NVIDIA Vulkan driver - common files
ii nvidia-vulkan-icd:amd64 470.141.03-1~deb11u1 amd64 NVIDIA Vulkan installable client driver (ICD)
un nvidia-vulkan-icd-any (no description available)
ii xserver-xorg-video-nvidia 470.141.03-1~deb11u1 amd64 NVIDIA binary Xorg driver
un xserver-xorg-video-nvidia-any (no description available)
un xserver-xorg-video-nvidia-legacy-304xx (no description available)
cli-version: 1.11.0 lib-version: 1.11.0 build date: 2022-09-06T09:21+00:00 build revision: c8f267be0bac1c654d59ad4ea5df907141149977 build compiler: x86_64-linux-gnu-gcc-8 8.3.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections