Open kpouget opened 3 years ago
I'm guessing this has to do with /dev
nodes under /dev/nvidia-caps
somehow not being synced to all containers as MIG devices are created and destroyed. If DCGM uses NVML to lookup the set of MIG devices and then expects the device nodes representing them to be present, I could see how it might crash. Running nvidia-smi -L
will cause any "missing" device nodes to be created dynamically. This is just a hypothesis, but it should be a good starting off point for anyone looking into this.
While working on MIG support for the GPU-operator on OpenShift, I observed a crash of the
dcgm-exporter
process. This crash occurred shortly after I changed the MIG slicing in another Pod.See this file for the logs: nvidia-dcgm-exporter.log
I also noticed that if I call
nvidia-smi -L
periodically in thedriver-daemonset
Pod, the crash doesn't occur.