Handle_remap not found handle

1. Issue or feature description

在使用vgpu的过程中偶尔会出现Handle_remap not found handle的问题

2. Steps to reproduce the issue

偶尔会出现这时候重建pod可以恢复正常在pod容器中输入nvidia-smi会报错

宿主机输入nvidia-smi正常同一台宿主机的pod输入nvidia-smi正常

3. Information to attach (optional if deemed irrelevant)

错误日志

root@service416776181220773888-55d7479f64-tvg9r:/# nvidia-smi
[4pdvGPU Debug(99:140414784235264:libvgpu.c:39)]: init_dlsym

[4pdvGPU Debug(99:140414784235264:libvgpu.c:61)]: into dlsym nvmlInitWithFlags
[4pdvGPU Debug(99:140414784235264:hook.c:542)]: nvmlInitWithFlags
...
[4pdvGPU Debug(99:140414784235264:hook.c:129)]: LOADING cuEventDestroy_v2 89
[4pdvGPU Debug(99:140414784235264:hook.c:129)]: LOADING cuModuleLoadDataEx 90
[4pdvGPU Debug(99:140414784235264:hook.c:129)]: LOADING cuModuleLoadFatBinary 91
[4pdvGPU Debug(99:140414784235264:hook.c:129)]: LOADING cuModuleGetFunction 92
[4pdvGPU Info(99:140414784235264:hook.c:136)]: loaded_cuda_libraries
[4pdvGPU Debug(99:140414784235264:multiprocess_memory_limit.c:476)]: Try create shrreg
[4pdvGPU Debug(99:140414784235264:hook.c:558)]: nvmlInit_v2
[4pdvGPU Debug(99:140414784235264:hook.c:560)]: Hijacking nvmlInit_v2
[4pdvGPU Debug(99:140414784235264:hook.c:542)]: nvmlInitWithFlags
[4pdvGPU Debug(99:140414784235264:hook.c:544)]: Hijacking nvmlInitWithFlags
[4pdvGPU Debug(99:140414784235264:hook.c:472)]: nvmlDeviceGetHandleByIndex_v2 index=0
[4pdvGPU Debug(99:140414784235264:hook.c:476)]: Hijacking nvmlDeviceGetHandleByIndex_v2
[4pdvGPU Debug(99:140414784235264:nvml_entry.c:775)]: Hijacking nvmlDeviceGetUUID
[4pdvGPU Debug(99:140414784235264:hook.c:472)]: nvmlDeviceGetHandleByIndex_v2 index=0
[4pdvGPU Debug(99:140414784235264:hook.c:476)]: Hijacking nvmlDeviceGetHandleByIndex_v2
[4pdvGPU Debug(99:140414784235264:nvml_entry.c:775)]: Hijacking nvmlDeviceGetUUID
[4pdvGPU Debug(99:140414784235264:hook.c:472)]: nvmlDeviceGetHandleByIndex_v2 index=1
[4pdvGPU Debug(99:140414784235264:hook.c:476)]: Hijacking nvmlDeviceGetHandleByIndex_v2
[4pdvGPU Debug(99:140414784235264:nvml_entry.c:775)]: Hijacking nvmlDeviceGetUUID
[4pdvGPU Debug(99:140414784235264:hook.c:472)]: nvmlDeviceGetHandleByIndex_v2 index=2
[4pdvGPU Debug(99:140414784235264:hook.c:476)]: Hijacking nvmlDeviceGetHandleByIndex_v2
[4pdvGPU Debug(99:140414784235264:nvml_entry.c:775)]: Hijacking nvmlDeviceGetUUID
[4pdvGPU ERROR (pid:99 thread=140414784235264 hook.c:285)]: Handle_remap not found handle=7fb4daa19938
nvidia-smi: /home/limengxuan/work/libcuda_override/src/nvml/hook.c:285: handle_remap: Assertion `0' failed.
Aborted (core dumped)

error-in-container.log

宿主机 nvidia-smi -a nvidia-smi-host.txt

4paradigm / k8s-vgpu-scheduler

Handle_remap not found handle #22

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)