Closed akfmdl closed 2 years ago
I deployed the gpu-feature-discovery by kubectl, but I met the error logs of pod
[Pod log]
kubectl logs gpu-feature-discovery-2mwxm gpu-feature-discovery 2022/08/30 02:07:24 Starting OS watcher. 2022/08/30 02:07:24 Loading configuration. gpu-feature-discovery: 2022/08/30 02:07:24 Running gpu-feature-discovery in version d83f52bf gpu-feature-discovery: 2022/08/30 02:07:24 Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": true, "gfd": { "oneshot": false, "noTimestamp": false, "sleepInterval": "1m0s", "outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd" } }, "resources": { "gpus": null }, "sharing": { "timeSlicing": {} } } gpu-feature-discovery: 2022/08/30 02:07:24 Start running gpu-feature-discovery: 2022/08/30 02:07:24 Warning: Error removing output file: failed to remove output file: remove /etc/kubernetes/node-feature-discovery/features.d/gfd: no such file or directory gpu-feature-discovery: 2022/08/30 02:07:24 Exiting gpu-feature-discovery: 2022/08/30 02:07:24 Error: error creating NVML labeler: failed to initialize NVML: unexpected failure calling nvml.Init: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
[Nvidia-smi in node]
nvidia-smi Tue Aug 30 11:05:13 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A40 Off | 00000000:27:00.0 Off | 0 | | 0% 37C P8 32W / 300W | 4MiB / 46068MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A40 Off | 00000000:A3:00.0 Off | 0 | | 0% 32C P8 29W / 300W | 4MiB / 46068MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A40 Off | 00000000:C3:00.0 Off | 0 | | 0% 35C P8 30W / 300W | 4MiB / 46068MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1865 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 1865 G /usr/lib/xorg/Xorg 4MiB | | 2 N/A N/A 1865 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+
[nvidia-container info in node]
sudo dpkg -l '*nvidia-container*' Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-=============================-============-============-===================================================== ii libnvidia-container-tools 1.10.0-1 amd64 NVIDIA container runtime library (command-line tools) ii libnvidia-container1:amd64 1.10.0-1 amd64 NVIDIA container runtime library un nvidia-container-runtime <none> <none> (no description available) un nvidia-container-runtime-hook <none> <none> (no description available) ii nvidia-container-toolkit 1.10.0-1 amd64 NVIDIA container runtime hook
add default-runtime to node
sudo vim /etc/docker/daemon.json
{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } }
I deployed the gpu-feature-discovery by kubectl, but I met the error logs of pod
[Pod log]
[Nvidia-smi in node]
[nvidia-container info in node]
Solution
add default-runtime to node
sudo vim /etc/docker/daemon.json
{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } }