NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.02k stars 301 forks source link

DCGM exporter crashes when installed by helm3 #180

Open jiangxiaosheng opened 3 years ago

jiangxiaosheng commented 3 years ago

Hi all, I followed the instructions in this guide to install the dcgm exporter in the prometheus framework. However, the dcgm exporter crashes. when typing kubectl logs dcgm-exporter-1619697251-pzb8q, it shows as below.

time="2021-04-29T12:35:59Z" level=info msg="Starting dcgm-exporter" time="2021-04-29T12:35:59Z" level=info msg="DCGM successfully initialized!" time="2021-04-29T12:35:59Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded" time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled" time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled" time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled" time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled" time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 64 ('DCGM_FI_PROF_PCIE_RX_BYTES'): DCP metrics not enabled" time="2021-04-29T12:35:59Z" level=info msg="Kubernetes metrics collection enabled!" time="2021-04-29T12:35:59Z" level=info msg="Starting webserver" time="2021-04-29T12:35:59Z" level=info msg="Pipeline starting"

when typing kubectl describe pod dcgm-exporter-1619697251-pzb8q, it shows as below.

Warning Unhealthy 58m (x5 over 59m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503 Warning Unhealthy 29m (x43 over 59m) kubelet Liveness probe failed: HTTP probe failed with statuscode: 503

My kubernetes version is 1.21.0, and the prometheus chart is kube-prometheus-stack-15.2.3. The version of dcgm exporter is dcgm-exporter-2.3.1. I have 2 GeForce 1080TI card in my machine.

I don't know what exactly causes this failure, and I've tried a lot of posts but unluckily they did not solve my problem. This problem is quite urgent for me since it's part of my undergraduate thesis, so any help will be greatly appreciated. Thanks in advance.

jiangxiaosheng commented 3 years ago

Is it because DCGM exporter does not support GeForce card as https://github.com/NVIDIA/gpu-monitoring-tools/issues/141 said? But I'm still confused since my dcgm version is 2.3.1 and @dualvtable said that after v2.1.2 these errors are prevented.

fabito commented 3 years ago

I am facing the same problem Here is the output of dcgmi discovery -l:

4 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information                                                   |
+--------+----------------------------------------------------------------------+
| 0      | Name: NVIDIA Quadro RTX 5000                                         |
|        | PCI Bus ID: 00000000:19:00.0                                         |
|        | Device UUID: GPU-fc67b07b-d44e-d387-2623-cdecf349ef9b                |
+--------+----------------------------------------------------------------------+
| 1      | Name: NVIDIA Quadro RTX 5000                                         |
|        | PCI Bus ID: 00000000:1A:00.0                                         |
|        | Device UUID: GPU-5ad20c11-6164-b709-27a4-e75eed635b49                |
+--------+----------------------------------------------------------------------+
| 2      | Name: NVIDIA Quadro RTX 5000                                         |
|        | PCI Bus ID: 00000000:67:00.0                                         |
|        | Device UUID: GPU-63caf029-1325-754e-0361-e30160b0432f                |
+--------+----------------------------------------------------------------------+
| 3      | Name: NVIDIA Quadro RTX 5000                                         |
|        | PCI Bus ID: 00000000:68:00.0                                         |
|        | Device UUID: GPU-71e4aeec-8cc9-db32-782a-87ef0d274db1                |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+

and nvidia-smi :

Tue May 11 17:36:55 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA Quadro R...  On   | 00000000:19:00.0 Off |                  Off |
| 34%   28C    P8     8W / 230W |   4077MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA Quadro R...  On   | 00000000:1A:00.0 Off |                  Off |
| 34%   30C    P8    16W / 230W |   1173MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA Quadro R...  On   | 00000000:67:00.0 Off |                  Off |
| 34%   30C    P8     7W / 230W |   1822MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA Quadro R...  On   | 00000000:68:00.0 Off |                  Off |
| 33%   32C    P8    14W / 230W |   1830MiB / 16122MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1365      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A   1680644      C   /opt/conda/bin/python            1231MiB |
|    0   N/A  N/A   2798996      C   tritonserver                     2836MiB |
|    1   N/A  N/A      1365      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A   2797762      C   /usr/bin/python3                 1165MiB |
|    2   N/A  N/A      1365      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A   2798996      C   tritonserver                     1812MiB |
|    3   N/A  N/A      1365      G   /usr/lib/xorg/Xorg                  9MiB |
|    3   N/A  N/A      1543      G   /usr/bin/gnome-shell                3MiB |
|    3   N/A  N/A   2798996      C   tritonserver                     1810MiB |
+-----------------------------------------------------------------------------+
fabito commented 3 years ago

As a workaround I've disabled the probes (liveness and readiness). The pod is not terminated/restarted anymore and Prometheus can now scrape the metrics.

Perhaps, in the /health endpoint, updateMetrics() should be invoked (at least once) before getMetrics() ?

https://github.com/NVIDIA/gpu-monitoring-tools/blob/75e0a1138db5c7be2f7049a4cde3761295d0761c/pkg/server.go#L100

Sanhajio commented 3 years ago

You can also override the livenessProbe: $ kubectl edit daemonset.apps/dcgm-exporter

Set it to:

        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 9400
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1

The issue being the livenessProbe is set to 5 seconds which is not enough for the process to start.

nuckydong commented 3 years ago

You can also override the livenessProbe: $ kubectl edit daemonset.apps/dcgm-exporter

Set it to:

        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /health
            port: 9400
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 1

The issue being the livenessProbe is set to 5 seconds which is not enough for the process to start.

THKS, fine now