NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
923 stars 159 forks source link

Cannot Retrieve GPU PIDs from DCGM Metrics #347

Closed doronkg closed 4 months ago

doronkg commented 4 months ago

Ask your question

Hi, I'm using NVIDIA GPU Operator to expose GPUs on my OpenShift cluster, and trying to create a PromQL aggregation to correlate GPU PIDs (Process ID) to K8s Pods.

From the exported DCGM metrics, I saw no metric with a label representing GPU PID. In the DCGM release notes, the following is mentioned:

The following features have been dropped or deprecated starting with DCGM 3.0: The following field identifiers have been removed: DCGM_FI_DEV_GRAPHICS_PIDS DCGM_FI_DEV_COMPUTE_PIDS ...

My question - is there a way to retrieve this info in the current version? Let me know if I should submit this issue to the DCGM GitHub repo instead.

The workaround I've implemented for the time being is running a custom DaemonSet on all GPU nodes, running the following command to correlate GPU PID and Pod UID, and using this as a custom metric:

$ nvidia-smi --query-compute-apps=pid --format=csv,noheader | xargs -I{} sh -c "echo -n '{}'; echo -n ','; grep -oPm1 '[0-9a-f]{8}(_[0-9a-f]{4}){3}_[0-9a-f]{12}' /proc/{}/cgroup | sed 's/_/-/g'" 
114855,c8b8d8a2-5e73-4c1a-b8e3-735e8a4e56d3
115044,1f7d9c8e-4a4b-455b-9b0d-9a2d1f4e6c2f

NOTE: It requires setting hostPid: true in the Pod spec.

Versions: OpenShift: v4.12.35 Kubernetes: v1.25.12+ba5cc25 NVIDIA GPU Operator: v23.3.2 DCGM Exporter: v3.1.7

dpointk commented 4 months ago

👀 following

Lynnery commented 4 months ago

Would like to see this implemented 👀

nvvfedorov commented 4 months ago

@doronkg , Please submit issue to the DCGM repository.

doronkg commented 4 months ago

@doronkg , Please submit issue to the DCGM repository.

Thanks, submitted here: https://github.com/NVIDIA/DCGM/issues/175 I am closing this one.