4paradigm / k8s-vgpu-scheduler

OpenAIOS vGPU device plugin for Kubernetes is originated from the OpenAIOS project to virtualize GPU device memory, in order to allow applications to access larger memory space than its physical capacity. It is designed for ease of use of extended device memory for AI workloads.
Apache License 2.0
489 stars 93 forks source link

Handle_remap not found handle #22

Open RexQian opened 2 years ago

RexQian commented 2 years ago

1. Issue or feature description

在使用vgpu的过程中偶尔会出现Handle_remap not found handle的问题

2. Steps to reproduce the issue

偶尔会出现 这时候重建pod可以恢复正常 在pod容器中输入nvidia-smi会报错

宿主机输入nvidia-smi正常 同一台宿主机的pod输入nvidia-smi正常

3. Information to attach (optional if deemed irrelevant)

错误日志

root@service416776181220773888-55d7479f64-tvg9r:/# nvidia-smi
[4pdvGPU Debug(99:140414784235264:libvgpu.c:39)]: init_dlsym

[4pdvGPU Debug(99:140414784235264:libvgpu.c:61)]: into dlsym nvmlInitWithFlags
[4pdvGPU Debug(99:140414784235264:hook.c:542)]: nvmlInitWithFlags
...
[4pdvGPU Debug(99:140414784235264:hook.c:129)]: LOADING cuEventDestroy_v2 89
[4pdvGPU Debug(99:140414784235264:hook.c:129)]: LOADING cuModuleLoadDataEx 90
[4pdvGPU Debug(99:140414784235264:hook.c:129)]: LOADING cuModuleLoadFatBinary 91
[4pdvGPU Debug(99:140414784235264:hook.c:129)]: LOADING cuModuleGetFunction 92
[4pdvGPU Info(99:140414784235264:hook.c:136)]: loaded_cuda_libraries
[4pdvGPU Debug(99:140414784235264:multiprocess_memory_limit.c:476)]: Try create shrreg
[4pdvGPU Debug(99:140414784235264:hook.c:558)]: nvmlInit_v2
[4pdvGPU Debug(99:140414784235264:hook.c:560)]: Hijacking nvmlInit_v2
[4pdvGPU Debug(99:140414784235264:hook.c:542)]: nvmlInitWithFlags
[4pdvGPU Debug(99:140414784235264:hook.c:544)]: Hijacking nvmlInitWithFlags
[4pdvGPU Debug(99:140414784235264:hook.c:472)]: nvmlDeviceGetHandleByIndex_v2 index=0
[4pdvGPU Debug(99:140414784235264:hook.c:476)]: Hijacking nvmlDeviceGetHandleByIndex_v2
[4pdvGPU Debug(99:140414784235264:nvml_entry.c:775)]: Hijacking nvmlDeviceGetUUID
[4pdvGPU Debug(99:140414784235264:hook.c:472)]: nvmlDeviceGetHandleByIndex_v2 index=0
[4pdvGPU Debug(99:140414784235264:hook.c:476)]: Hijacking nvmlDeviceGetHandleByIndex_v2
[4pdvGPU Debug(99:140414784235264:nvml_entry.c:775)]: Hijacking nvmlDeviceGetUUID
[4pdvGPU Debug(99:140414784235264:hook.c:472)]: nvmlDeviceGetHandleByIndex_v2 index=1
[4pdvGPU Debug(99:140414784235264:hook.c:476)]: Hijacking nvmlDeviceGetHandleByIndex_v2
[4pdvGPU Debug(99:140414784235264:nvml_entry.c:775)]: Hijacking nvmlDeviceGetUUID
[4pdvGPU Debug(99:140414784235264:hook.c:472)]: nvmlDeviceGetHandleByIndex_v2 index=2
[4pdvGPU Debug(99:140414784235264:hook.c:476)]: Hijacking nvmlDeviceGetHandleByIndex_v2
[4pdvGPU Debug(99:140414784235264:nvml_entry.c:775)]: Hijacking nvmlDeviceGetUUID
[4pdvGPU ERROR (pid:99 thread=140414784235264 hook.c:285)]: Handle_remap not found handle=7fb4daa19938
nvidia-smi: /home/limengxuan/work/libcuda_override/src/nvml/hook.c:285: handle_remap: Assertion `0' failed.
Aborted (core dumped)

error-in-container.log

宿主机 nvidia-smi -a nvidia-smi-host.txt