NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.01k stars 301 forks source link

Erro start dcgm-exporter pod - module of DCGM that is not currently loaded #141

Open josericardomcastro opened 3 years ago

josericardomcastro commented 3 years ago

My dcgm-exporter pod is crashing. Any ideias?

time="2020-12-15T11:28:40Z" level=info msg="Starting dcgm-exporter"
time="2020-12-15T11:28:40Z" level=info msg="DCGM successfully initialized!"
time="2020-12-15T11:28:40Z" level=fatal msg="Error watching fields: This request is serviced by a module of DCGM that is not currently loaded"
dualvtable commented 3 years ago

Can you please provide details of your platform? For example:

  1. DCGM exporter version
  2. GPU used
  3. Driver version
  4. Kubernetes version

Prior to dcgm-exporter 2.1.2, we used to load profiling metrics by default and these metrics are only available on datacenter GPUs (i.e. 'Tesla'). Starting with 2.1.2, we added better error handling to prevent these errors and skip gathering metrics that are not supported.