NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.02k stars 301 forks source link

Exposed metrics don't follow Prometheus spec #126

Open etherandrius opened 3 years ago

etherandrius commented 3 years ago

The exact field broken https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md#grouping-and-sorting

HELP and TYPE metrics should be grouped together with the metric they refer to. ex:

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-a6e9292c-35bc-0f18-41b1-b46804c7562e", device="nvidia0",container="",namespace="",pod=""} 300

# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
# TYPE DCGM_FI_DEV_FB_USED gauge
DCGM_FI_DEV_FB_USED{gpu="0", UUID="GPU-a6e9292c-35bc-0f18-41b1-b46804c7562e", device="nvidia0",container="",namespace="",pod=""} 0

today dcgm-exporter does not follow the spec and instead groups all HELP, TYPE expressions at the start and only later prints the metrics:

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
# TYPE DCGM_FI_DEV_FB_USED gauge

DCGM_FI_DEV_FB_USED{gpu="0", UUID="GPU-a6e9292c-35bc-0f18-41b1-b46804c7562e", device="nvidia0",container="",namespace="",pod=""} 0
DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-a6e9292c-35bc-0f18-41b1-b46804c7562e", device="nvidia0",container="",namespace="",pod=""} 300

This causes some tools to act pathologically and lose the TYPE and HELP information, for example losing the TYPE of a metric and dropping it.