4paradigm / k8s-vgpu-scheduler

OpenAIOS vGPU device plugin for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory, in order to allow applications to access larger memory space than its physical capacity. It is designed for ease of use of extended device memory for AI workloads.
Apache License 2.0
482 stars 91 forks source link

Failed to initialize NVML: could not load NVML library. #36

Open zbjjyy opened 5 months ago

zbjjyy commented 5 months ago

ENV :

K8s : v1.23.10 Runtime: docker 20.10.8 NVIDIA System Management Interface -- v535.161.07 Image: 4pdosc/k8s-device-plugin:v0.10.0.4-ubuntu20.04

Issue:

after deploy the plugin ds ,the logs shows:

2024/03/27 15:41:13 Loading PciInfo

 0 = 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)

 1 = 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]

 2 = 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]

 3 = 00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)

 4 = 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)

 5 = 00:02.0 VGA compatible controller: Cirrus Logic GD 5446

 6 = 00:03.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge

 7 = 00:04.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge

 8 = 00:05.0 Ethernet controller: Red Hat, Inc. Virtio network device

 9 = 00:06.0 Audio device: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) High Definition Audio Controller (rev 01)

 10 = 00:07.0 SCSI storage controller: Red Hat, Inc. Virtio block device

 11 = 00:08.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)

 found 00:08.0

 12 = 00:09.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon

 13 = 

 pcibusstr= 00:08.0

 2024/03/27 15:41:13 Loading NVML

 2024/03/27 15:41:13 Failed to initialize NVML: could not load NVML library.

 2024/03/27 15:41:13 If this is a GPU node, did you set the docker default runtime to `nvidia`?

 2024/03/27 15:41:13 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites

 2024/03/27 15:41:13 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

 2024/03/27 15:41:13 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
  1. I have checked the env, and nvidia-smi works on the vm
root@master:/usr/local/vgpu# nvidia-smi 
Wed Mar 27 15:46:02 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           Off | 00000000:00:08.0 Off |                    0 |
| N/A   31C    P0              23W / 300W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
zbjjyy commented 5 months ago

{ "default-runtime": "nvidia" }