NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
742 stars 134 forks source link

exported_pod cause issue with query -> every sample a different metrics #340

Open amir-bialek opened 3 weeks ago

amir-bialek commented 3 weeks ago

Ask your question

Running dcgm-exporter on k8s install via helm chart, default values. Cluster have 1 master 1 worker, only worker have GPU expose as resource.

Running a simple query: DCGM_FI_DEV_GPU_TEMP

Return:

DCGM_FI_DEV_GPU_TEMP{DCGM_FI_DRIVER_VERSION="545.29.06", Hostname="worker1", UUID="GPU-longgui", container="exporter", device="nvidia0", endpoint="metrics", exported_container="ai-artifactory-control", exported_namespace="default", exported_pod="pod1", gpu="0", instance="someip:9400", job="dcgm-exporter-dev", modelName="NVIDIA GeForce RTX 4090", namespace="monitoring", pod="dcgm-exporter-dev-78z52", service="dcgm-exporter-dev"}
DCGM_FI_DEV_GPU_TEMP{DCGM_FI_DRIVER_VERSION="545.29.06", Hostname="worker1", UUID="GPU-longgui", container="exporter", device="nvidia0", endpoint="metrics", exported_container="somecontainer", exported_namespace="default", exported_pod="pod2", gpu="0", instance="someip:9400", job="dcgm-exporter-dev", modelName="NVIDIA GeForce RTX 4090", namespace="monitoring", pod="dcgm-exporter-dev-78z52", service="dcgm-exporter-dev"}
DCGM_FI_DEV_GPU_TEMP{DCGM_FI_DRIVER_VERSION="545.29.06", Hostname="worker1", UUID="GPU-longgui", container="exporter", device="nvidia0", endpoint="metrics", exported_container="runner", exported_namespace="somenamespace", exported_pod="pod3", gpu="0", instance="someip:9400", job="dcgm-exporter-dev", modelName="NVIDIA GeForce RTX 4090", namespace="monitoring", pod="dcgm-exporter-dev-78z52", service="dcgm-exporter-dev"}
DCGM_FI_DEV_GPU_TEMP{DCGM_FI_DRIVER_VERSION="545.29.06", Hostname="worker1", UUID="GPU-longgui", container="exporter", device="nvidia0", endpoint="metrics", exported_container="runner", exported_namespace="somenamespace", exported_pod="pod4", gpu="0", instance="someip:9400", job="dcgm-exporter-dev", modelName="NVIDIA GeForce RTX 4090", namespace="monitoring", pod="dcgm-exporter-dev-78z52", service="dcgm-exporter-dev"}
...

However since this is only 1 gpu, I would like to receive only 1 result.. To explain better, setting up the dashboard on Grafana give me :

image

And what we would like to get is:

image

nvvfedorov commented 3 weeks ago

@amir-bialek, Labels with the "exported_" prefix come from the DCGM-exporter. From the metrics values that you shared with us, I see:

  1. Kubernetes mode is enabled - the DCGM exporter does mapping of GPU metrics to PODs. That's why we see: exported_container, exported_namespace, and exported_pod. Regarding the "exported" prefix, please refer to here: https://prometheus.io/docs/prometheus/latest/configuration/configuration/

  2. GPU=0 was used by the following containers: ai-artifactory-control, somecontainer, and two instances of the runner container.

This behavior is expected for the DCGM-exporter with Kubernetes mode enabled.

amir-bialek commented 3 weeks ago

Hey @nvvfedorov , thank you for the answer.

So if I have several pods sharing the same GPU via time slicing, how can I solve this issue?

nvvfedorov commented 3 weeks ago

Today, timesharing is not supported by DCGM and DCGM-exporter. However, if you run a few container and each used the same GPU, you will see multiple metrics associated with the same GPU.