Open jeonghyunkeem opened 3 weeks ago
Could you please provide the exact hami image version to help trace the specific code line? It currently appears that certain map
-type fields in the scheduler might be accessed concurrently without locks, causing a fatal error: concurrent map iteration and map write
@Nimbus318 vgpu-scheduler-extender
uses a following image: projecthami/hami:v2.3.13
@jeonghyunkeem Got it, I checked, and I know where the problem is. This issue has already been fixed in #418, so it should no longer occur if you use the latest version, 2.4.0.
@Nimbus318 Thanks. I'll test v2.4.0 and close this issue if it works.
What happened:
vgpu-scheduler-extender
container (part ofhami-scheduler
pod) keeps terminated with exit code 2.What you expected to happen:
vgpu-scheduler-extender
stays alive without terminationHow to reproduce it (as minimally and precisely as possible): I'm not sure as it happens randomly
Anything else we need to know?:
I'm using multiple gpu nodes in my cluster and each node has
hami.io/node-nvidia-register
annotation as follows:nvidia-smi -a
on your host/etc/docker/daemon.json
)here are the final logs of terminated
vgpu-scheduler-extender
container:sudo journalctl -r -u kubelet
)dmesg
Environment:
docker version
uname -a