in administrator space, we are not able to view few important metrics in nvidia DCGM Exporter Dashboard such as :
1: GPU utilization
2: GPU Framebuffer Mem Used
3: Tensor Core Utilization
We are able to view few metrics such as gpu temperature etc but above metrics are much important for us.
Issue 2 : In developer space
We are not able to see any metrics in nvidia DCGM Exporter Dashboard. We are able to see few metrics in administrator space but not able to see any metrics in developer space. Is there any way we can monitor gpu utilization per namespace also so that application team can monitor gpu utilization in their namespace on their own.
Issue 3: In section compute > GPU , we are not able to see any Realtime utilization date. Every time gpu utilization metrics are showing as 0%.
Environment:
Openshift version: 4.16.10 nvidia-operator- version: 24.6.1
Hello Team,
We are facing below issue:
Issue 1:
in administrator space, we are not able to view few important metrics in nvidia DCGM Exporter Dashboard such as :
1: GPU utilization 2: GPU Framebuffer Mem Used 3: Tensor Core Utilization
We are able to view few metrics such as gpu temperature etc but above metrics are much important for us.
Issue 2 : In developer space
We are not able to see any metrics in nvidia DCGM Exporter Dashboard. We are able to see few metrics in administrator space but not able to see any metrics in developer space. Is there any way we can monitor gpu utilization per namespace also so that application team can monitor gpu utilization in their namespace on their own.
Issue 3: In section compute > GPU , we are not able to see any Realtime utilization date. Every time gpu utilization metrics are showing as 0%.
I am attaching screenshots for all the issues.