Hi, I'm using NVIDIA GPU Operator to expose GPUs on my OpenShift cluster, and trying to create a PromQL aggregation to correlate GPU PIDs (Process ID) to K8s Pods.
From the exported DCGM metrics, I saw no metric with a label representing GPU PID.
In the DCGM release notes, the following is mentioned:
The following features have been dropped or deprecated starting with DCGM 3.0:
The following field identifiers have been removed:
DCGM_FI_DEV_GRAPHICS_PIDS
DCGM_FI_DEV_COMPUTE_PIDS
...
My question - is there a way to retrieve this info in the current version?
I originally submitted this issue to the DCGM Exporter GitHub repo.
The workaround I've implemented for the time being is running a custom DaemonSet on all GPU nodes, running the following command to correlate GPU PID and Pod UID, and using this as a custom metric:
Ask your question
Hi, I'm using NVIDIA GPU Operator to expose GPUs on my OpenShift cluster, and trying to create a PromQL aggregation to correlate GPU PIDs (Process ID) to K8s Pods.
From the exported DCGM metrics, I saw no metric with a label representing GPU PID. In the DCGM release notes, the following is mentioned:
My question - is there a way to retrieve this info in the current version? I originally submitted this issue to the DCGM Exporter GitHub repo.
The workaround I've implemented for the time being is running a custom DaemonSet on all GPU nodes, running the following command to correlate GPU PID and Pod UID, and using this as a custom metric:
Versions: OpenShift: v4.12.35 Kubernetes: v1.25.12+ba5cc25 NVIDIA GPU Operator: v23.3.2 DCGM Exporter: v3.1.7