NVIDIA / gpu-feature-discovery

GPU plugin to the node feature discovery for Kubernetes
Apache License 2.0
292 stars 47 forks source link

unexpected failure calling nvml.Init: error opening libnvidia-ml.so.1 #25

Closed akfmdl closed 2 years ago

akfmdl commented 2 years ago

I deployed the gpu-feature-discovery by kubectl, but I met the error logs of pod

[Pod log]

kubectl logs gpu-feature-discovery-2mwxm gpu-feature-discovery
2022/08/30 02:07:24 Starting OS watcher.
2022/08/30 02:07:24 Loading configuration.
gpu-feature-discovery: 2022/08/30 02:07:24 Running gpu-feature-discovery in version d83f52bf
gpu-feature-discovery: 2022/08/30 02:07:24
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "gfd": {
      "oneshot": false,
      "noTimestamp": false,
      "sleepInterval": "1m0s",
      "outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd"
    }
  },
  "resources": {
    "gpus": null
  },
  "sharing": {
    "timeSlicing": {}
  }
}
gpu-feature-discovery: 2022/08/30 02:07:24 Start running
gpu-feature-discovery: 2022/08/30 02:07:24 Warning: Error removing output file: failed to remove output file: remove /etc/kubernetes/node-feature-discovery/features.d/gfd: no such file or directory
gpu-feature-discovery: 2022/08/30 02:07:24 Exiting
gpu-feature-discovery: 2022/08/30 02:07:24 Error: error creating NVML labeler: failed to initialize NVML: unexpected failure calling nvml.Init: error opening libnvidia-ml.so.1: libnvidia-ml.so.1: cannot open shared object file: No such file or directory

[Nvidia-smi in node]

nvidia-smi
Tue Aug 30 11:05:13 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          Off  | 00000000:27:00.0 Off |                    0 |
|  0%   37C    P8    32W / 300W |      4MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A40          Off  | 00000000:A3:00.0 Off |                    0 |
|  0%   32C    P8    29W / 300W |      4MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A40          Off  | 00000000:C3:00.0 Off |                    0 |
|  0%   35C    P8    30W / 300W |      4MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1865      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      1865      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      1865      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

[nvidia-container info in node]

sudo dpkg -l '*nvidia-container*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                          Version      Architecture Description
+++-=============================-============-============-=====================================================
ii  libnvidia-container-tools     1.10.0-1     amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64    1.10.0-1     amd64        NVIDIA container runtime library
un  nvidia-container-runtime      <none>       <none>       (no description available)
un  nvidia-container-runtime-hook <none>       <none>       (no description available)
ii  nvidia-container-toolkit      1.10.0-1     amd64        NVIDIA container runtime hook

Solution

add default-runtime to node

sudo vim /etc/docker/daemon.json

{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } } }