NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.02k stars 301 forks source link

dcgm-exporter doesn't see GPU processes and GPU memory usage #209

Open lev-stas opened 3 years ago

lev-stas commented 3 years ago

Hi, I'm trying to set GPU monitoring via Grafana/Prometheus. I have stand alone server with two GPUs and use dcgm-exporter in docker container as metrics exporter. I run docker in privileged mode by command docker run -d -e --priveleged -v /home/dockeradm/nvidia-smi-exporter/default-counters.csv:/etc/dcgm-exporter/default-counters.csv -p9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu18.04 , and it see my GPUs. But it can't detect GPU processes and GPU Memory Usage. There is output of nvidia-smi util from host

]$ nvidia-smi
Mon Aug 23 23:03:29 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 455.32.00    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:37:00.0 Off |                    0 |
| N/A   60C    P0    42W / 250W |   1393MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:86:00.0 Off |                    0 |
| N/A   64C    P0    47W / 250W |  10095MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     17748      C   ...189c/arasov/bin/python3.7        0MiB |
|    0   N/A  N/A     53799      C   ...189c/arasov/bin/python3.7     1389MiB |
|    1   N/A  N/A     17748      C   ...189c/arasov/bin/python3.7    10091MiB |
|    1   N/A  N/A     53799      C   ...189c/arasov/bin/python3.7        0MiB |
+-----------------------------------------------------------------------------+

and there is the output of nvidia-smi inside the container

root@ccdc999ac0bd:/# nvidia-smi
Mon Aug 23 19:25:22 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00    Driver Version: 455.32.00    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:37:00.0 Off |                    0 |
| N/A   59C    P0    41W / 250W |   1393MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:86:00.0 Off |                    0 |
| N/A   62C    P0    46W / 250W |  10095MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Am I missing something or doing something wrong? How should I set container to make it detect GPU processes and GPU usage?