4paradigm / k8s-vgpu-scheduler

OpenAIOS vGPU device plugin for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory, in order to allow applications to access larger memory space than its physical capacity. It is designed for ease of use of extended device memory for AI workloads.
Apache License 2.0
489 stars 93 forks source link

can't find function nvmlDeviceGetComputeRunningProcesses_v2 in libnvidia-ml.so.1 #4

Closed haijohn closed 2 years ago

haijohn commented 3 years ago

nvidia-smi runs successfully on host and insider the container if I use the original k8s-device-plugin, but I got following erorr if using this vgpu device plugin

output of nvidia-smi

can't find function nvmlDeviceGetComputeRunningProcesses_v2 in libnvidia-ml.so.1
Tue Aug  3 08:36:12 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40           Off  | 00000000:00:08.0 Off |                  Off |
| N/A   35C    P0    63W / 250W |    174MiB / 12215MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
haijohn commented 3 years ago

detailed logs:

into dlsym nvmlInitWithFlags

nvmlInitWithFlags
can't find function nvmlDeviceGetComputeRunningProcesses_v2 in libnvidia-ml.so.1
loaded nvml libraries
NVML DeviceGetHandleByUUIDNot supportedGPU-caba9b00-6386-2c33-7834-646ef2692cb7

v=0 p=GPU-caba9b00-6386-2c33-7834-646ef2692cb7 idx=0

virtual devices=1

sm_limit 0:100

sm_limit 1:100

sm_limit 2:100

sm_limit 3:100

sm_limit 4:100

sm_limit 5:100

sm_limit 6:100

sm_limit 7:100

sm_limit 8:100

sm_limit 9:100

sm_limit 10:100

sm_limit 11:100

sm_limit 12:100

sm_limit 13:100

sm_limit 14:100

sm_limit 15:100

into dlsym nvmlInternalGetExportTable

into dlsym nvmlDeviceGetCount_v2

NVML DeviceGetCount virtual=1

into dlsym nvmlDeviceGetHandleByIndex_v2

nvmlDeviceGetHandleByIndex_v2 index=0

into dlsym nvmlEventSetCreate

into dlsym nvmlSystemGetDriverVersion

into dlsym nvmlSystemGetCudaDriverVersion_v2

into dlsym cuDriverGetVersion

Wed Aug  4 01:57:25 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
into dlsym nvmlDeviceGetIndex

into dlsym nvmlDeviceGetName

into dlsym nvmlDeviceGetPciInfo_v3

into dlsym nvmlDeviceGetPersistenceMode

into dlsym nvmlDeviceGetDisplayActive

into dlsym nvmlDeviceGetEccMode

into dlsym nvmlDeviceGetFanSpeed

into dlsym nvmlDeviceGetTemperature

into dlsym nvmlDeviceGetPerformanceState

into dlsym nvmlDeviceGetPowerUsage

into dlsym nvmlDeviceGetEnforcedPowerLimit

into dlsym nvmlDeviceGetMemoryInfo

origin_free=12808486912 total=12808486912

dev=0 i=0

get_current_device_memory_usage:tick=5 result=117440512

usage=117440512 limit=12808355840

into dlsym nvmlDeviceGetUtilizationRates

into dlsym nvmlDeviceGetComputeMode

|   0  Tesla M40           Off  | 00000000:00:08.0 Off |                  Off |
| N/A   25C    P8    16W / 250W |    112MiB / 12215MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
into dlsym nvmlDeviceGetComputeRunningProcesses

Get RunningProcesses_v2
into dlsym nvmlDeviceGetGraphicsRunningProcesses

into dlsym nvmlDeviceGetMPSComputeRunningProcesses

|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
into dlsym nvmlDeviceValidateInforom

into dlsym nvmlEventSetFree

into dlsym nvmlShutdown

Calling exit handler
archlitchi commented 3 years ago

这是个warning,代表你的驱动版本不是最新的,有一些cuda11的接口找不到,不会影响结果,另外请升级镜像到4pdosc/k8s-device-plugin:latest

haijohn commented 3 years ago

thanks,the latest image solves the problem