NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.02k stars 301 forks source link

dcgm exporter doesn't monitor mig disabled gpus with mixed strategy #211

Open chloejiwon opened 3 years ago

chloejiwon commented 3 years ago

Hi, I'm trying to use dcgm exporter to monitor gpu utilization with mig device. mig mixed strategy was set on k8s and gpu setting is as follows.

Can I get mig disabled and mig enabled metrics values?

+-----------------------------------------------------------------------------+ | MIG devices: | +------------------+----------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG| | | | ECC| | |==================+======================+===========+=======================| | 0 7 0 0 | 4846MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 2MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 8 0 1 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 9 0 2 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 10 0 3 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 11 0 4 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 12 0 5 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 13 0 6 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 7 0 585962 C python 4839MiB | +-----------------------------------------------------------------------------+


- in prometheus, when i search `DCGM_FI_PROF_GR_ENGINE_ACTIVE` metric then these outputs are shown:
![image](https://user-images.githubusercontent.com/17642294/132439307-3c5dff28-6af1-4d82-a437-ffab6ffdad5a.png)