NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
387 stars 50 forks source link

Cannot Retrieve GPU PIDs from DCGM Metrics #175

Open doronkg opened 3 months ago

doronkg commented 3 months ago

Ask your question

Hi, I'm using NVIDIA GPU Operator to expose GPUs on my OpenShift cluster, and trying to create a PromQL aggregation to correlate GPU PIDs (Process ID) to K8s Pods.

From the exported DCGM metrics, I saw no metric with a label representing GPU PID. In the DCGM release notes, the following is mentioned:

The following features have been dropped or deprecated starting with DCGM 3.0: The following field identifiers have been removed: DCGM_FI_DEV_GRAPHICS_PIDS DCGM_FI_DEV_COMPUTE_PIDS ...

My question - is there a way to retrieve this info in the current version? I originally submitted this issue to the DCGM Exporter GitHub repo.

The workaround I've implemented for the time being is running a custom DaemonSet on all GPU nodes, running the following command to correlate GPU PID and Pod UID, and using this as a custom metric:

$ nvidia-smi --query-compute-apps=pid --format=csv,noheader | xargs -I{} sh -c "echo -n '{}'; echo -n ','; grep -oPm1 '[0-9a-f]{8}(_[0-9a-f]{4}){3}_[0-9a-f]{12}' /proc/{}/cgroup | sed 's/_/-/g'" 
114855,c8b8d8a2-5e73-4c1a-b8e3-735e8a4e56d3
115044,1f7d9c8e-4a4b-455b-9b0d-9a2d1f4e6c2f

NOTE: It requires setting hostPid: true in the Pod spec.

Versions: OpenShift: v4.12.35 Kubernetes: v1.25.12+ba5cc25 NVIDIA GPU Operator: v23.3.2 DCGM Exporter: v3.1.7