NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.02k stars 301 forks source link

Fixed grouping of prometheus metrics #197

Closed MarcusWichelmann closed 3 years ago

MarcusWichelmann commented 3 years ago

This PR fixes the grouping of metrics as required by: https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md#grouping-and-sorting

Sample output:

# HELP DCGM_FI_DEV_GPU_TEMP GPU temperature (in C).
# TYPE DCGM_FI_DEV_GPU_TEMP gauge
DCGM_FI_DEV_GPU_TEMP{gpu="0",UUID="GPU-...",device="nvidia0",Hostname="..."} 49
DCGM_FI_DEV_GPU_TEMP{gpu="1",UUID="GPU-...",device="nvidia0",Hostname="..."} 55
DCGM_FI_DEV_GPU_TEMP{gpu="2",UUID="GPU-...",device="nvidia0",Hostname="..."} 54
DCGM_FI_DEV_GPU_TEMP{gpu="3",UUID="GPU-...",device="nvidia0",Hostname="..."} 54
# HELP DCGM_FI_DEV_POWER_USAGE Power draw (in W).
# TYPE DCGM_FI_DEV_POWER_USAGE gauge
DCGM_FI_DEV_POWER_USAGE{gpu="0",UUID="GPU-...",device="nvidia0",Hostname="..."} 32.227000
DCGM_FI_DEV_POWER_USAGE{gpu="1",UUID="GPU-...",device="nvidia0",Hostname="..."} 30.400000
DCGM_FI_DEV_POWER_USAGE{gpu="2",UUID="GPU-...",device="nvidia0",Hostname="..."} 31.178000
DCGM_FI_DEV_POWER_USAGE{gpu="3",UUID="GPU-...",device="nvidia0",Hostname="..."} 33.124000
cloud-native-bot[bot] commented 3 years ago

Hello There!

Thanks for your contribution! This repository is a read-only mirror of the following repository: https://gitlab.com/nvidia/container-toolkit/gpu-monitoring-tools

Do you mind:

Thanks again!

MarcusWichelmann commented 3 years ago

Okay, see https://gitlab.com/nvidia/container-toolkit/gpu-monitoring-tools/-/merge_requests/77