Open fortminors opened 1 week ago
I just found out that dcgm does not support GTX/RTX gpus, unfortunately, as pointed out by this comment. It would be really useful to add this to documentation, as I can easily build a cloud with GTX/RTX gpus.
Is there a similar tool that does the same thing for GTX/RTX? Except of course profiling with nsys/ncu.
I just want to monitor the SM occupancy rates at every point of time without interfering with the running programs
I also encountered the same problem, how to solve it? Driver Version: 525.85.12 exporter-image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04
nvidia-smi
Sat Oct 12 16:06:30 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
logs:
time="2024-10-12T07:02:49Z" level=info msg="Starting dcgm-exporter"
time="2024-10-12T07:02:49Z" level=info msg="DCGM successfully initialized!"
time="2024-10-12T07:02:49Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-10-12T07:02:49Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/default-counters.csv"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_NVLINK_RX_BYTES'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_NVLINK_TX_BYTES'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 28 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 29 ('DCGM_FI_PROF_SM_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 30 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 31 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 32 ('DCGM_FI_PROF_PIPE_FP64_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 33 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 34 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 35 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2024-10-12T07:02:49Z" level=warning msg="Skipping line 36 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
time="2024-10-12T07:02:49Z" level=info msg="Initializing system entities of type: GPU"
time="2024-10-12T07:02:55Z" level=info msg="Initializing system entities of type: NvSwitch"
time="2024-10-12T07:02:55Z" level=info msg="Not collecting switch metrics: no switches to monitor"
time="2024-10-12T07:02:55Z" level=info msg="Initializing system entities of type: NvLink"
time="2024-10-12T07:02:55Z" level=info msg="Not collecting link metrics: no switches to monitor"
time="2024-10-12T07:02:55Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-10-12T07:02:55Z" level=info msg="Pipeline starting"
time="2024-10-12T07:02:55Z" level=info msg="Starting webserver"
Hello! I have built dcgm-exporter from source with
Then, I have created a custom metrics file with
And finally started dcgm-exporter with the custom metrics
This gives me
Watching at http://localhost:9400/metrics does not show any metrics, so I assume they are not collected (and/or not enabled), which is actually stated in the dcgm-exporter logs.
I have also tried using the latest dcgm-exporter docker images (
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04
- latest andnvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
- that matches my driver that ships with CUDA 12.2) withBut it gives me the same output
How should I deal with this issue? And how do I enable these metrics?