Open amir-bialek opened 3 weeks ago
@amir-bialek, Labels with the "exported_" prefix come from the DCGM-exporter. From the metrics values that you shared with us, I see:
Kubernetes mode is enabled - the DCGM exporter does mapping of GPU metrics to PODs. That's why we see: exported_container, exported_namespace, and exported_pod. Regarding the "exported" prefix, please refer to here: https://prometheus.io/docs/prometheus/latest/configuration/configuration/
GPU=0 was used by the following containers: ai-artifactory-control, somecontainer, and two instances of the runner container.
This behavior is expected for the DCGM-exporter with Kubernetes mode enabled.
Hey @nvvfedorov , thank you for the answer.
So if I have several pods sharing the same GPU via time slicing, how can I solve this issue?
Today, timesharing is not supported by DCGM and DCGM-exporter. However, if you run a few container and each used the same GPU, you will see multiple metrics associated with the same GPU.
Ask your question
Running dcgm-exporter on k8s install via helm chart, default values. Cluster have 1 master 1 worker, only worker have GPU expose as resource.
Running a simple query:
DCGM_FI_DEV_GPU_TEMP
Return:
However since this is only 1 gpu, I would like to receive only 1 result.. To explain better, setting up the dashboard on Grafana give me :
And what we would like to get is: