NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.76k stars 284 forks source link

Not able to view Gpu utilization metrics in openshift dashboard #1002

Open umeshvw opened 4 hours ago

umeshvw commented 4 hours ago

Environment:

Openshift version: 4.16.10 nvidia-operator- version: 24.6.1

Hello Team,

We are facing below issue:

Issue 1:

in administrator space, we are not able to view few important metrics in nvidia DCGM Exporter Dashboard such as :

1: GPU utilization 2: GPU Framebuffer Mem Used 3: Tensor Core Utilization

We are able to view few metrics such as gpu temperature etc but above metrics are much important for us.

Issue 2 : In developer space

We are not able to see any metrics in nvidia DCGM Exporter Dashboard. We are able to see few metrics in administrator space but not able to see any metrics in developer space. Is there any way we can monitor gpu utilization per namespace also so that application team can monitor gpu utilization in their namespace on their own.

Issue 3: In section compute > GPU , we are not able to see any Realtime utilization date. Every time gpu utilization metrics are showing as 0%.

I am attaching screenshots for all the issues.

umeshvw commented 3 hours ago

Image Image Image