can't find function nvmlDeviceGetComputeRunningProcesses_v2 in libnvidia-ml.so.1

haijohn commented 3 years ago

nvidia-smi runs successfully on host and insider the container if I use the original k8s-device-plugin, but I got following erorr if using this vgpu device plugin

output of nvidia-smi

can't find function nvmlDeviceGetComputeRunningProcesses_v2 in libnvidia-ml.so.1
Tue Aug  3 08:36:12 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40           Off  | 00000000:00:08.0 Off |                  Off |
| N/A   35C    P0    63W / 250W |    174MiB / 12215MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|

haijohn commented 3 years ago

detailed logs:

into dlsym nvmlInitWithFlags

nvmlInitWithFlags
can't find function nvmlDeviceGetComputeRunningProcesses_v2 in libnvidia-ml.so.1
loaded nvml libraries
NVML DeviceGetHandleByUUIDNot supportedGPU-caba9b00-6386-2c33-7834-646ef2692cb7

v=0 p=GPU-caba9b00-6386-2c33-7834-646ef2692cb7 idx=0

virtual devices=1

sm_limit 0:100

sm_limit 1:100

sm_limit 2:100

sm_limit 3:100

sm_limit 4:100

sm_limit 5:100

sm_limit 6:100

sm_limit 7:100

sm_limit 8:100

sm_limit 9:100

sm_limit 10:100

sm_limit 11:100

sm_limit 12:100

sm_limit 13:100

sm_limit 14:100

sm_limit 15:100

into dlsym nvmlInternalGetExportTable

into dlsym nvmlDeviceGetCount_v2

NVML DeviceGetCount virtual=1

into dlsym nvmlDeviceGetHandleByIndex_v2

nvmlDeviceGetHandleByIndex_v2 index=0

into dlsym nvmlEventSetCreate

into dlsym nvmlSystemGetDriverVersion

into dlsym nvmlSystemGetCudaDriverVersion_v2

into dlsym cuDriverGetVersion

Wed Aug  4 01:57:25 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
into dlsym nvmlDeviceGetIndex

into dlsym nvmlDeviceGetName

into dlsym nvmlDeviceGetPciInfo_v3

into dlsym nvmlDeviceGetPersistenceMode

into dlsym nvmlDeviceGetDisplayActive

into dlsym nvmlDeviceGetEccMode

into dlsym nvmlDeviceGetFanSpeed

into dlsym nvmlDeviceGetTemperature

into dlsym nvmlDeviceGetPerformanceState

into dlsym nvmlDeviceGetPowerUsage

into dlsym nvmlDeviceGetEnforcedPowerLimit

into dlsym nvmlDeviceGetMemoryInfo

origin_free=12808486912 total=12808486912

dev=0 i=0

get_current_device_memory_usage:tick=5 result=117440512

usage=117440512 limit=12808355840

into dlsym nvmlDeviceGetUtilizationRates

into dlsym nvmlDeviceGetComputeMode

|   0  Tesla M40           Off  | 00000000:00:08.0 Off |                  Off |
| N/A   25C    P8    16W / 250W |    112MiB / 12215MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
into dlsym nvmlDeviceGetComputeRunningProcesses

Get RunningProcesses_v2
into dlsym nvmlDeviceGetGraphicsRunningProcesses

into dlsym nvmlDeviceGetMPSComputeRunningProcesses

|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
into dlsym nvmlDeviceValidateInforom

into dlsym nvmlEventSetFree

into dlsym nvmlShutdown

Calling exit handler

archlitchi commented 3 years ago

这是个warning，代表你的驱动版本不是最新的，有一些cuda11的接口找不到，不会影响结果，另外请升级镜像到4pdosc/k8s-device-plugin:latest

haijohn commented 3 years ago

thanks，the latest image solves the problem

4paradigm / k8s-vgpu-scheduler

can't find function nvmlDeviceGetComputeRunningProcesses_v2 in libnvidia-ml.so.1 #4