Open rrzatkie opened 2 years ago
Is your kubernetes cluster set up to use docker
or containerd
as its underlying container runtime? If it’s containerd
, you need to follow the instructions under the containerd
tab here to set it up:
https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#install-nvidia-container-toolkit-nvidia-docker2
@klueska thank you for the response. My kubernetes is set to use docker. Fortunately I was able to locate the problem - I forgot to mention that I use Kind (https://kind.sigs.k8s.io/) for Kubernetes deployment and cluster creation. I found out that docker container which hosts the k8s cluster was not initialized with --gpus=all
flag 😄 It is not covered in current version of Kind tool. With a bit of luck I noticed that jacobtomlinson created PR with his implementation of this feature. For my own purpose it is enough to use his build.
To sum up - from the beginning it was not the problem od nvidia device plugin itself - the lack of gpu support in kind did the harm 😃
PR: https://github.com/jacobtomlinson/kind/pull/1
This issue can be closed of course.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
1. Issue or feature description
I am struggling to enable gpu on my local kubernetes cluster. I have two Tesla M10 available (visible in nvidia-smi), I am able to run docker image based on
nvidia/cuda:11.2
and easly get proper response ofnvidia-smi
inside container. When I add a kubernetes Deployment with container based on the same image and add request fornvidia.com/gpu: 1
, I get logs:I followed those tutorials:
Is there something that I am missing?
2. Steps to reproduce the issue
Common error checking:
nvidia-smi -a
on your host==============NVSMI LOG==============
Timestamp : Sat May 14 14:51:49 2022 Driver Version : 460.32.03 CUDA Version : 11.2
Attached GPUs : 2 GPU 00000000:0B:00.0 Product Name : Tesla M10 Product Brand : Tesla Display Mode : Enabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 1425020050393 GPU UUID : GPU-e79e9189-d14c-0d3a-34eb-043008697f57 Minor Number : 0 VBIOS Version : 82.07.BC.00.04 MultiGPU Board : No Board ID : 0xb00 GPU Part Number : 900-22405-0100-030 Inforom Version Image Version : 2405.0070.00.02 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization Mode : Pass-Through Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x0B Device : 0x00 Domain : 0x0000 Device Id : 0x13BD10DE Bus Id : 00000000:0B:00.0 Sub System Id : 0x116010DE GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 8x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : N/A Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : N/A HW Power Brake Slowdown : N/A Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 8129 MiB Used : 0 MiB Free : 8129 MiB BAR1 Memory Usage Total : 256 MiB Used : 1 MiB Free : 255 MiB Compute Mode : Default Utilization Gpu : 1 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 34 C GPU Shutdown Temp : 96 C GPU Slowdown Temp : 91 C GPU Max Operating Temp : N/A GPU Target Temperature : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 17.01 W Power Limit : 53.00 W Default Power Limit : 53.00 W Enforced Power Limit : 53.00 W Min Power Limit : 26.50 W Max Power Limit : 53.00 W Clocks Graphics : 1032 MHz SM : 1032 MHz Memory : 2600 MHz Video : 929 MHz Applications Clocks Graphics : 1032 MHz Memory : 2600 MHz Default Applications Clocks Graphics : 1032 MHz Memory : 2600 MHz Max Clocks Graphics : 1202 MHz SM : 1202 MHz Memory : 2600 MHz Video : 1081 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes : None
GPU 00000000:13:00.0 Product Name : Tesla M10 Product Brand : Tesla Display Mode : Enabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 1425020050393 GPU UUID : GPU-3e0656c4-543e-498c-7c01-233fa0b83445 Minor Number : 1 VBIOS Version : 82.07.BC.00.03 MultiGPU Board : No Board ID : 0x1300 GPU Part Number : 900-22405-0100-030 Inforom Version Image Version : 2405.0070.00.02 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization Mode : Pass-Through Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x13 Device : 0x00 Domain : 0x0000 Device Id : 0x13BD10DE Bus Id : 00000000:13:00.0 Sub System Id : 0x116010DE GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 8x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : N/A Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : N/A HW Power Brake Slowdown : N/A Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 8129 MiB Used : 0 MiB Free : 8129 MiB BAR1 Memory Usage Total : 256 MiB Used : 1 MiB Free : 255 MiB Compute Mode : Default Utilization Gpu : 1 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 34 C GPU Shutdown Temp : 96 C GPU Slowdown Temp : 91 C GPU Max Operating Temp : N/A GPU Target Temperature : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 16.71 W Power Limit : 53.00 W Default Power Limit : 53.00 W Enforced Power Limit : 53.00 W Min Power Limit : 26.50 W Max Power Limit : 53.00 W Clocks Graphics : 1032 MHz SM : 1032 MHz Memory : 2600 MHz Video : 929 MHz Applications Clocks Graphics : 1032 MHz Memory : 2600 MHz Default Applications Clocks Graphics : 1032 MHz Memory : 2600 MHz Max Clocks Graphics : 1202 MHz SM : 1202 MHz Memory : 2600 MHz Video : 1081 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes : None
$ cat /etc/docker/daemon.json { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } }
$ kubectl logs nvidia-device-plugin-daemonset-j76n5 -n kube-system 2022/05/13 23:09:23 Loading NVML 2022/05/13 23:09:23 Failed to initialize NVML: could not load NVML library. 2022/05/13 23:09:23 If this is a GPU node, did you set the docker default runtime to 'nvidia'? 2022/05/13 23:09:23 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites 2022/05/13 23:09:23 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start 2022/05/13 23:09:23 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes`
$ docker version Client: Docker Engine - Community Version: 20.10.16 API version: 1.41 Go version: go1.17.10 Git commit: aa7e414 Built: Thu May 12 09:17:23 2022 OS/Arch: linux/amd64 Context: default Experimental: true
Server: Docker Engine - Community Engine: Version: 20.10.16 API version: 1.41 (minimum version 1.12) Go version: go1.17.10 Git commit: f756502 Built: Thu May 12 09:15:28 2022 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.4 GitCommit: 212e8b6fa2f44b9c21b2798135fc6fb7c53efc16 nvidia: Version: 1.1.1 GitCommit: v1.1.1-0-g52de29d docker-init: Version: 0.19.0 GitCommit: de40ad0
$ uname -a Linux smarttherapy2 5.4.0-109-generic #123-Ubuntu SMP Fri Apr 8 09:10:54 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ dpkg -l 'nvidia' Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-=============================-============-============-===================================================== ii libnvidia-container-tools 1.9.0-1 amd64 NVIDIA container runtime library (command-line tools) ii libnvidia-container1:amd64 1.9.0-1 amd64 NVIDIA container runtime library un nvidia-container-runtime (no description available)
un nvidia-container-runtime-hook (no description available)
ii nvidia-container-toolkit 1.9.0-1 amd64 NVIDIA container runtime hook
un nvidia-docker (no description available)
ii nvidia-docker2 2.10.0-1 all nvidia-docker CLI wrapper
$ nvidia-container-cli -V cli-version: 1.9.0 lib-version: 1.9.0 build date: 2022-03-18T13:46+00:00 build revision: 5e135c17d6dbae861ec343e9a8d3a0d2af758a4f build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections