Closed doronkg closed 4 months ago
👀 following
Would like to see this implemented 👀
@doronkg , Please submit issue to the DCGM repository.
@doronkg , Please submit issue to the DCGM repository.
Thanks, submitted here: https://github.com/NVIDIA/DCGM/issues/175 I am closing this one.
Ask your question
Hi, I'm using NVIDIA GPU Operator to expose GPUs on my OpenShift cluster, and trying to create a PromQL aggregation to correlate GPU PIDs (Process ID) to K8s Pods.
From the exported DCGM metrics, I saw no metric with a label representing GPU PID. In the DCGM release notes, the following is mentioned:
My question - is there a way to retrieve this info in the current version? Let me know if I should submit this issue to the DCGM GitHub repo instead.
The workaround I've implemented for the time being is running a custom DaemonSet on all GPU nodes, running the following command to correlate GPU PID and Pod UID, and using this as a custom metric:
Versions: OpenShift: v4.12.35 Kubernetes: v1.25.12+ba5cc25 NVIDIA GPU Operator: v23.3.2 DCGM Exporter: v3.1.7