Open jiangxiaosheng opened 3 years ago
Is it because DCGM exporter does not support GeForce card as https://github.com/NVIDIA/gpu-monitoring-tools/issues/141 said? But I'm still confused since my dcgm version is 2.3.1 and @dualvtable said that after v2.1.2 these errors are prevented.
I am facing the same problem
Here is the output of dcgmi discovery -l
:
4 GPUs found.
+--------+----------------------------------------------------------------------+
| GPU ID | Device Information |
+--------+----------------------------------------------------------------------+
| 0 | Name: NVIDIA Quadro RTX 5000 |
| | PCI Bus ID: 00000000:19:00.0 |
| | Device UUID: GPU-fc67b07b-d44e-d387-2623-cdecf349ef9b |
+--------+----------------------------------------------------------------------+
| 1 | Name: NVIDIA Quadro RTX 5000 |
| | PCI Bus ID: 00000000:1A:00.0 |
| | Device UUID: GPU-5ad20c11-6164-b709-27a4-e75eed635b49 |
+--------+----------------------------------------------------------------------+
| 2 | Name: NVIDIA Quadro RTX 5000 |
| | PCI Bus ID: 00000000:67:00.0 |
| | Device UUID: GPU-63caf029-1325-754e-0361-e30160b0432f |
+--------+----------------------------------------------------------------------+
| 3 | Name: NVIDIA Quadro RTX 5000 |
| | PCI Bus ID: 00000000:68:00.0 |
| | Device UUID: GPU-71e4aeec-8cc9-db32-782a-87ef0d274db1 |
+--------+----------------------------------------------------------------------+
0 NvSwitches found.
+-----------+
| Switch ID |
+-----------+
+-----------+
and nvidia-smi
:
Tue May 11 17:36:55 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA Quadro R... On | 00000000:19:00.0 Off | Off |
| 34% 28C P8 8W / 230W | 4077MiB / 16125MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA Quadro R... On | 00000000:1A:00.0 Off | Off |
| 34% 30C P8 16W / 230W | 1173MiB / 16125MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA Quadro R... On | 00000000:67:00.0 Off | Off |
| 34% 30C P8 7W / 230W | 1822MiB / 16125MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA Quadro R... On | 00000000:68:00.0 Off | Off |
| 33% 32C P8 14W / 230W | 1830MiB / 16122MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1365 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 1680644 C /opt/conda/bin/python 1231MiB |
| 0 N/A N/A 2798996 C tritonserver 2836MiB |
| 1 N/A N/A 1365 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 2797762 C /usr/bin/python3 1165MiB |
| 2 N/A N/A 1365 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 2798996 C tritonserver 1812MiB |
| 3 N/A N/A 1365 G /usr/lib/xorg/Xorg 9MiB |
| 3 N/A N/A 1543 G /usr/bin/gnome-shell 3MiB |
| 3 N/A N/A 2798996 C tritonserver 1810MiB |
+-----------------------------------------------------------------------------+
As a workaround I've disabled the probes (liveness and readiness). The pod is not terminated/restarted anymore and Prometheus can now scrape the metrics.
Perhaps, in the /health
endpoint, updateMetrics()
should be invoked (at least once) before getMetrics()
?
You can also override the livenessProbe:
$ kubectl edit daemonset.apps/dcgm-exporter
Set it to:
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 9400
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
The issue being the livenessProbe is set to 5 seconds which is not enough for the process to start.
You can also override the livenessProbe:
$ kubectl edit daemonset.apps/dcgm-exporter
Set it to:
livenessProbe: failureThreshold: 3 httpGet: path: /health port: 9400 scheme: HTTP initialDelaySeconds: 60 periodSeconds: 5 successThreshold: 1 timeoutSeconds: 1
The issue being the livenessProbe is set to 5 seconds which is not enough for the process to start.
THKS, fine now
Hi all, I followed the instructions in this guide to install the dcgm exporter in the prometheus framework. However, the dcgm exporter crashes. when typing
kubectl logs dcgm-exporter-1619697251-pzb8q
, it shows as below.time="2021-04-29T12:35:59Z" level=info msg="Starting dcgm-exporter" time="2021-04-29T12:35:59Z" level=info msg="DCGM successfully initialized!" time="2021-04-29T12:35:59Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded" time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled" time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled" time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled" time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled" time="2021-04-29T12:35:59Z" level=warning msg="Skipping line 64 ('DCGM_FI_PROF_PCIE_RX_BYTES'): DCP metrics not enabled" time="2021-04-29T12:35:59Z" level=info msg="Kubernetes metrics collection enabled!" time="2021-04-29T12:35:59Z" level=info msg="Starting webserver" time="2021-04-29T12:35:59Z" level=info msg="Pipeline starting"
when typing
kubectl describe pod dcgm-exporter-1619697251-pzb8q
, it shows as below.Warning Unhealthy 58m (x5 over 59m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 503 Warning Unhealthy 29m (x43 over 59m) kubelet Liveness probe failed: HTTP probe failed with statuscode: 503
My kubernetes version is 1.21.0, and the prometheus chart is kube-prometheus-stack-15.2.3. The version of dcgm exporter is dcgm-exporter-2.3.1. I have 2 GeForce 1080TI card in my machine.
I don't know what exactly causes this failure, and I've tried a lot of posts but unluckily they did not solve my problem. This problem is quite urgent for me since it's part of my undergraduate thesis, so any help will be greatly appreciated. Thanks in advance.