NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.02k stars 301 forks source link

dcgm-exporter running on "g4dn.metal" in AWS EKS fails with "fatal: morestack on gsignal" #208

Open SQUIDwarrior opened 3 years ago

SQUIDwarrior commented 3 years ago

We are running the "dcgm-exporter" Kubernetes DaemonsetSet on AWS EKS, and whenever we use a "g4dn.metal" EC2 instance, the "dcgm-exporter" gets stuck in a crashloop with the following log message:

time="2021-08-13T20:07:08Z" level=info msg="Starting dcgm-exporter"
time="2021-08-13T20:07:09Z" level=info msg="DCGM successfully initialized!"
time="2021-08-13T20:07:27Z" level=info msg="Collecting DCP Metrics"
fatal: morestack on gsignal

This does not happen on any other G4DN class of machine, only with the "metal" variant. The NVIDIA drivers are installed and user code utilizing the GPUs is running fine. Using "nvidia-smi" results shows all 8 GPUs as expected. I have done searching and I cannot find any information on this.

josericardomcastro commented 3 years ago

Same issue here in my on premises cluster...