dcgm-exporter crashes after MIG reconfiguration

NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux

Apache License 2.0

1.02k stars 301 forks source link

dcgm-exporter crashes after MIG reconfiguration #142

Open kpouget opened 3 years ago

kpouget commented 3 years ago

While working on MIG support for the GPU-operator on OpenShift, I observed a crash of the dcgm-exporter process. This crash occurred shortly after I changed the MIG slicing in another Pod.

See this file for the logs: nvidia-dcgm-exporter.log

I also noticed that if I call nvidia-smi -L periodically in the driver-daemonset Pod, the crash doesn't occur.

klueska commented 3 years ago

I'm guessing this has to do with /dev nodes under /dev/nvidia-caps somehow not being synced to all containers as MIG devices are created and destroyed. If DCGM uses NVML to lookup the set of MIG devices and then expects the device nodes representing them to be present, I could see how it might crash. Running nvidia-smi -L will cause any "missing" device nodes to be created dynamically. This is just a hypothesis, but it should be a good starting off point for anyone looking into this.