Running dcgm-exporter 2.1.8 connecting to nv-hostengine via DCGM_REMOTE_HOSTENGINE_INFO=localhost:5555
If nv-hostengine is restarted, dcgm-exporter starts repeating the below message every 30 secs. Meanwhile it continues to serve up old metrics from the last point prior to the restart. The /health endpoint indicates that everything is fine.
time="2021-05-12T17:31:12Z" level=error msg="Failed to collect metrics with error: Failed to collect metrics with error: Error getting the latest value for fields: Host engine connection invalid/disconnected"
time="2021-05-12T17:31:42Z" level=error msg="Failed to collect metrics with error: Failed to collect metrics with error: Error getting the latest value for fields: Host engine connection invalid/disconnected"
dcgm-exporter should either crash hard in response to this error, or re-connect to nv-hostengine. It should not continue to report stale metrics.
Running dcgm-exporter 2.1.8 connecting to nv-hostengine via
DCGM_REMOTE_HOSTENGINE_INFO=localhost:5555
If nv-hostengine is restarted, dcgm-exporter starts repeating the below message every 30 secs. Meanwhile it continues to serve up old metrics from the last point prior to the restart. The
/health
endpoint indicates that everything is fine.dcgm-exporter should either crash hard in response to this error, or re-connect to nv-hostengine. It should not continue to report stale metrics.