NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
923 stars 159 forks source link

How does the DCGM exporter work with DCGM? #383

Closed changhyuni closed 2 months ago

changhyuni commented 2 months ago

Ask your question

Is DCGM just a binary file (not a system)?

When I type the command dcgmi on the host where the DCGM exporter is installed, I get "command not found".

No matter how much I look on the host, I can't find any DCGM binaries. However, in the DCGM exporter logs, I see messages indicating it's working, and I can see the metrics being exported.

Is the dcgm exporter pod performing dcmgi using host mount?

my logs:

 nvidia-dcgm-exporter time="2024-08-30T08:22:52Z" level=info msg="Starting dcgm-exporter"                                                                                               │
│ nvidia-dcgm-exporter time="2024-08-30T08:22:52Z" level=info msg="DCGM successfully initialized!"                                                                                       │
│ nvidia-dcgm-exporter time="2024-08-30T08:22:52Z" level=info msg="Collecting DCP Metrics"                                                                                               │
│ nvidia-dcgm-exporter time="2024-08-30T08:22:52Z" level=info msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"                                            │
│ nvidia-dcgm-exporter time="2024-08-30T08:22:52Z" level=info msg="Initializing system entities of type: GPU"                                                                            │
│ nvidia-dcgm-exporter time="2024-08-30T08:22:52Z" level=info msg="Not collecting NvSwitch metrics; no fields to watch for device type: 3"                                               │
│ nvidia-dcgm-exporter time="2024-08-30T08:22:52Z" level=info msg="Not collecting NvLink metrics; no fields to watch for device type: 6"                                                 │
│ nvidia-dcgm-exporter time="2024-08-30T08:22:52Z" level=info msg="Not collecting CPU metrics; no fields to watch for device type: 7"                                                    │
│ nvidia-dcgm-exporter time="2024-08-30T08:22:52Z" level=info msg="Not collecting CPU Core metrics; no fields to watch for device type: 8"                                               │
│ nvidia-dcgm-exporter time="2024-08-30T08:22:52Z" level=info msg="Kubernetes metrics collection enabled!"                                                                               │
│ nvidia-dcgm-exporter time="2024-08-30T08:22:52Z" level=info msg="Pipeline starting"                                                                                                    │
│ nvidia-dcgm-exporter time="2024-08-30T08:22:52Z" level=info msg="Starting webserver"                                                                                                   │
│ nvidia-dcgm-exporter time="2024-08-30T08:22:52Z" level=info msg="Listening on" address=":9400"                                                                                         │
│ nvidia-dcgm-exporter time="2024-08-30T08:22:52Z" level=info msg="TLS is disabled." address=":9400" http2=false
glowkey commented 2 months ago

DCGM-Exporter uses go-dcgm, which in most cases utilizes DCGM's embedded hostengine functionality contained in libdcgm.so. /usr/bin/dcgmi is normally installed with the datacenter-gpu-manager package.

changhyuni commented 2 months ago

@glowkey Thanks you for comment. So, The dcmg exporter should work fine even if you don't install dcmg on the worker node(k8s), right?

glowkey commented 2 months ago

DCGM-Exporter requires libdcgm.so (found in the package 'datacenter-gpu-manager') but I'm not sure what you mean by 'worker node(k8s)' so I'm not sure how to answer your question.