NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.02k stars 301 forks source link

dcgm-exporter missing metrics for A100 GPU #166

Open anaconda2196 opened 3 years ago

anaconda2196 commented 3 years ago

GPU Machine: A100-PCIE-40GB. [gpu-monitoring-tools-2.3.1]

I am using latest release of for dcgm-exporter ( 2.1.4-2.3.1-ubuntu18.04).

kubectl get pods -A
NAMESPACE              NAME                                                              READY   STATUS    RESTARTS   AGE
default                dcgm-exporter-1615787551-qc8dm                                    1/1     Running   0          86s

In prometheus while query executing, I found few missing metrics DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_MEM_COPY_UTIL, DCGM_FI_DEV_ENC_UTIL, DCGM_FI_DEV_DEC_UTIL.

I do see them enabled in default-counters.csv though inside my running pod. Is it a bug or not supporting these metrics for A100 GPU?

I have checked with other GPU Machines (4 Tesla, V100) and everything looks good and able to get all metrics.

Thank you in advance.

crinavar commented 3 years ago

Hi Anaconda, The metrics are working here on a DGX A100 we have. By chance, did you subdivide the GPUs as MIG devices? MIG GPUs are currently not detected for some metrics.

dualvtable commented 3 years ago

Hi guys - yes, we are working on adding MIG support into dcgm-exporter so we can do metric attribution to MIG devices. We hope to make a release in the next couple of weeks.

supertetelman commented 3 years ago

Any update on this? I am also not seeing DCGM_FI_DEV_GPU_UTIL show up on the latest dcgm-exporter release. I am seeing this on DGX Stations's with V100 and A100.

I am however seeing the other three metrics mentioned here.

This is running with version nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.2-ubuntu20.04.

jaimehrubiks commented 3 years ago

Same issue here with latest versions and all types of GPUs. Just tried some previous version "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu20.04" and I got GPU_UTIL metric back on all servers

jfolz commented 3 years ago

Is this maybe related to #143? I.e., these metrics were turned off by default a while ago.