NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.01k stars 301 forks source link

dcgm-exporter reports stale metrics if nvhost-engine is restarted #188

Open bchess opened 3 years ago

bchess commented 3 years ago

Running dcgm-exporter 2.1.8 connecting to nv-hostengine via DCGM_REMOTE_HOSTENGINE_INFO=localhost:5555

If nv-hostengine is restarted, dcgm-exporter starts repeating the below message every 30 secs. Meanwhile it continues to serve up old metrics from the last point prior to the restart. The /health endpoint indicates that everything is fine.

time="2021-05-12T17:31:12Z" level=error msg="Failed to collect metrics with error: Failed to collect metrics with error: Error getting the latest value for fields: Host engine connection invalid/disconnected"
time="2021-05-12T17:31:42Z" level=error msg="Failed to collect metrics with error: Failed to collect metrics with error: Error getting the latest value for fields: Host engine connection invalid/disconnected"

dcgm-exporter should either crash hard in response to this error, or re-connect to nv-hostengine. It should not continue to report stale metrics.