Open anaconda2196 opened 3 years ago
Hi Anaconda, The metrics are working here on a DGX A100 we have. By chance, did you subdivide the GPUs as MIG devices? MIG GPUs are currently not detected for some metrics.
Hi guys - yes, we are working on adding MIG support into dcgm-exporter
so we can do metric attribution to MIG devices. We hope to make a release in the next couple of weeks.
Any update on this? I am also not seeing DCGM_FI_DEV_GPU_UTIL
show up on the latest dcgm-exporter release. I am seeing this on DGX Stations's with V100 and A100.
I am however seeing the other three metrics mentioned here.
This is running with version nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.2-ubuntu20.04
.
Same issue here with latest versions and all types of GPUs. Just tried some previous version "nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu20.04" and I got GPU_UTIL metric back on all servers
Is this maybe related to #143? I.e., these metrics were turned off by default a while ago.
GPU Machine: A100-PCIE-40GB. [gpu-monitoring-tools-2.3.1]
I am using latest release of for dcgm-exporter ( 2.1.4-2.3.1-ubuntu18.04).
In prometheus while query executing, I found few missing metrics DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_MEM_COPY_UTIL, DCGM_FI_DEV_ENC_UTIL, DCGM_FI_DEV_DEC_UTIL.
I do see them enabled in default-counters.csv though inside my running pod. Is it a bug or not supporting these metrics for A100 GPU?
I have checked with other GPU Machines (4 Tesla, V100) and everything looks good and able to get all metrics.
Thank you in advance.