NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
937 stars 164 forks source link

DCGM exporter is not able monitor per-Pod gpu utilization. #201

Closed kishor-3 closed 8 months ago

larry-lu-lu commented 1 year ago

I reproduce the issue when I use the time-sharding feature of k8s-device-plugin. Is that the same situation for you ?

kishor-3 commented 1 year ago

@larry-lu-lu I don't what you are talking about, but in my case i deployed dcgm exporter to scrape gpu metrics, but for me it is only able to scrape the metrics of the node and not the metrics of the pods which are utilizing the gpu on that node.

larry-lu-lu commented 1 year ago

@kishor-3

  In my case, i  has virtualized 3 instance vGPU through config the  nvidia/k8s-device-plugin,like below:
version: v1
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 3

I tried to add logs to the source code of dcgm exporter and repackage and deploy, I was pleasantly surprised to find that DCGM can only collect the information of the physical GPU, and the collected metrics information records the UUID of the physical GPU, but the GPU ID of each pod is the UUID of the corresponding physical GPU append an instance number. For example, the UUID of a physical GPU is ABCD, while the GPU ID assigned to a pod is ABCD-0, or ABCD-1。As a result, the metrics information will be never matched with the pod。

GalaSlE commented 1 year ago

try to config the helm values.yaml

dcgmExporter:
    relabelings: 
    - sourceLabels: [__meta_kubernetes_pod_node_name]
      separator: ;
      regex: ^(.*)$
      targetLabel: nodename
      replacement: $1
      action: replace

this will modify the parsed kube api data filed name so they can be recognized.