NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
890 stars 154 forks source link

INFO[0000] Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded #398

Open fortminors opened 1 week ago

fortminors commented 1 week ago

Hello! I have built dcgm-exporter from source with

git clone https://github.com/NVIDIA/dcgm-exporter.git
cd dcgm-exporter
make binary

Then, I have created a custom metrics file with

cat << EOT > dcp-metrics-custom.csv
DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active.
DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
EOT

And finally started dcgm-exporter with the custom metrics

sudo cmd/dcgm-exporter/dcgm-exporter -c 500 -f dcp-metrics-custom.csv

This gives me

2024/10/09 11:10:23 maxprocs: Leaving GOMAXPROCS=16: CPU quota undefined
INFO[0000] Starting dcgm-exporter                       
INFO[0000] DCGM successfully initialized!               
INFO[0000] Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded 
INFO[0000] Falling back to metric file 'dcp-metrics-custom.csv' 
WARN[0000] Skipping line 0 ('DCGM_FI_PROF_SM_OCCUPANCY'): metric not enabled 
WARN[0000] Skipping line 1 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled 
WARN[0000] Skipping line 2 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled 
WARN[0000] Skipping line 3 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled 
INFO[0000] Not collecting GPU metrics; no fields to watch for device type: 1 
INFO[0000] Not collecting NvSwitch metrics; no fields to watch for device type: 3 
INFO[0000] Not collecting NvLink metrics; no fields to watch for device type: 6 
INFO[0000] Not collecting CPU metrics; no fields to watch for device type: 7 
INFO[0000] Not collecting CPU Core metrics; no fields to watch for device type: 8 
INFO[0000] Pipeline starting                            
INFO[0000] Starting webserver                           
INFO[0000] Listening on                                  address="[::]:9400"
INFO[0000] TLS is disabled.                              address="[::]:9400" http2=false

Watching at http://localhost:9400/metrics does not show any metrics, so I assume they are not collected (and/or not enabled), which is actually stated in the dcgm-exporter logs.

I have also tried using the latest dcgm-exporter docker images (nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04 - latest and nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04 - that matches my driver that ships with CUDA 12.2) with

docker run --gpus all -v ./custom_metrics/dcp-metrics-custom.csv:/etc/dcgm-exporter/custom_metrics/dcp-metrics-custom.csv --net host --cap-add SYS_ADMIN --privileged nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04 -f /etc/dcgm-exporter/custom_metrics/dcp-metrics-custom.csv

But it gives me the same output

2024/10/09 11:51:23 maxprocs: Leaving GOMAXPROCS=16: CPU quota undefined
time="2024-10-09T11:51:23Z" level=info msg="Starting dcgm-exporter"
time="2024-10-09T11:51:23Z" level=info msg="DCGM successfully initialized!"
time="2024-10-09T11:51:24Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-10-09T11:51:24Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/custom_metrics/dcp-metrics-custom.csv'"
time="2024-10-09T11:51:24Z" level=warning msg="Skipping line 0 ('DCGM_FI_PROF_SM_OCCUPANCY'): metric not enabled"
time="2024-10-09T11:51:24Z" level=warning msg="Skipping line 1 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2024-10-09T11:51:24Z" level=warning msg="Skipping line 2 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled"
time="2024-10-09T11:51:24Z" level=warning msg="Skipping line 3 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled"
time="2024-10-09T11:51:24Z" level=info msg="Not collecting GPU metrics; no fields to watch for device type: 1"
time="2024-10-09T11:51:24Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"
time="2024-10-09T11:51:24Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"
time="2024-10-09T11:51:24Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"
time="2024-10-09T11:51:24Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"
time="2024-10-09T11:51:24Z" level=info msg="Pipeline starting"
time="2024-10-09T11:51:24Z" level=info msg="Starting webserver"
time="2024-10-09T11:51:24Z" level=info msg="Listening on" address="[::]:9400"
time="2024-10-09T11:51:24Z" level=info msg="TLS is disabled." address="[::]:9400" http2=false

How should I deal with this issue? And how do I enable these metrics?

$ nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        Off | 00000000:01:00.0  On |                  N/A |
|100%   91C    P2             141W / 170W |   4675MiB / 12288MiB |     89%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2436      G   /usr/lib/xorg/Xorg                         1216MiB |
|    0   N/A  N/A      2729      G   /usr/bin/gnome-shell                        169MiB |
|    0   N/A  N/A      4459      G   ...Telegram/Telegram                          2MiB |
|    0   N/A  N/A      5432      G   ...ures=SpareRendererForSitePerProcess      131MiB |
|    0   N/A  N/A      8298      G   ...seed-version=20241008-180117.502000      523MiB |
|    0   N/A  N/A     55717      G   ...ures=SpareRendererForSitePerProcess       68MiB |
|    0   N/A  N/A     71215      G   ...erProcess --variations-seed-version       60MiB |
|    0   N/A  N/A    591055      C   /prog                                      2478MiB |
+---------------------------------------------------------------------------------------+
$ dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules                                                                      |
| Status: Success                                                                   |
+===========+====================+==================================================+
| Module ID | Name               | State                                            |
+-----------+--------------------+--------------------------------------------------+
| 0         | Core               | Loaded                                           |
| 1         | NvSwitch           | Loaded                                           |
| 2         | VGPU               | Not loaded                                       |
| 3         | Introspection      | Not loaded                                       |
| 4         | Health             | Not loaded                                       |
| 5         | Policy             | Not loaded                                       |
| 6         | Config             | Not loaded                                       |
| 7         | Diag               | Not loaded                                       |
| 8         | Profiling          | Not loaded                                       |
| 9         | SysMon             | Not loaded                                       |
+-----------+--------------------+--------------------------------------------------+
$ sudo nv-hostengine -f host.log --log-level debug
Err: Failed to start DCGM Server: -7
fortminors commented 1 week ago

I just found out that dcgm does not support GTX/RTX gpus, unfortunately, as pointed out by this comment. It would be really useful to add this to documentation, as I can easily build a cloud with GTX/RTX gpus.

Is there a similar tool that does the same thing for GTX/RTX? Except of course profiling with nsys/ncu.

I just want to monitor the SM occupancy rates at every point of time without interfering with the running programs

yyang4069 commented 1 week ago

I also encountered the same problem, how to solve it? Driver Version: 525.85.12 exporter-image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04

nvidia-smi Sat Oct 12 16:06:30 2024
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+

logs:

time="2024-10-12T07:02:49Z" level=info msg="Starting dcgm-exporter"
time="2024-10-12T07:02:49Z" level=info msg="DCGM successfully initialized!"
time="2024-10-12T07:02:49Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-10-12T07:02:49Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/default-counters.csv"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_NVLINK_RX_BYTES'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_NVLINK_TX_BYTES'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 28 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 29 ('DCGM_FI_PROF_SM_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 30 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 31 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 32 ('DCGM_FI_PROF_PIPE_FP64_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 33 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 34 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 35 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 36 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
time="2024-10-12T07:02:49Z" level=info msg="Initializing system entities of type: GPU"
time="2024-10-12T07:02:55Z" level=info msg="Initializing system entities of type: NvSwitch"
time="2024-10-12T07:02:55Z" level=info msg="Not collecting switch metrics: no switches to monitor"
time="2024-10-12T07:02:55Z" level=info msg="Initializing system entities of type: NvLink"
time="2024-10-12T07:02:55Z" level=info msg="Not collecting link metrics: no switches to monitor"
time="2024-10-12T07:02:55Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-10-12T07:02:55Z" level=info msg="Pipeline starting"
time="2024-10-12T07:02:55Z" level=info msg="Starting webserver"