Closed kishor-3 closed 8 months ago
@larry-lu-lu I don't what you are talking about, but in my case i deployed dcgm exporter to scrape gpu metrics, but for me it is only able to scrape the metrics of the node and not the metrics of the pods which are utilizing the gpu on that node.
@kishor-3
In my case, i has virtualized 3 instance vGPU through config the nvidia/k8s-device-plugin,like below:
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 3
I tried to add logs to the source code of dcgm exporter and repackage and deploy, I was pleasantly surprised to find that DCGM can only collect the information of the physical GPU, and the collected metrics information records the UUID of the physical GPU, but the GPU ID of each pod is the UUID of the corresponding physical GPU append an instance number. For example, the UUID of a physical GPU is ABCD, while the GPU ID assigned to a pod is ABCD-0, or ABCD-1。As a result, the metrics information will be never matched with the pod。
try to config the helm values.yaml
dcgmExporter:
relabelings:
- sourceLabels: [__meta_kubernetes_pod_node_name]
separator: ;
regex: ^(.*)$
targetLabel: nodename
replacement: $1
action: replace
this will modify the parsed kube api data filed name so they can be recognized.
I reproduce the issue when I use the time-sharding feature of k8s-device-plugin. Is that the same situation for you ?