NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.02k stars 301 forks source link

node-exporter #54

Closed damon008 closed 4 years ago

damon008 commented 4 years ago

I apply this:https://github.com/NVIDIA/gpu-monitoring-tools/blob/master/exporters/prometheus-dcgm/k8s/node-exporter/gpu-node-exporter-daemonset.yaml , show failed from pod logs

root@XP005:/home/gpu-monitoring-tools# kubectl logs -f node-exporter-c99c5 -c nvidia-dcgm-exporter
Starting NVIDIA host engine...
Failed to start host engine server
Collecting metrics at /run/prometheus/dcgm.prom every 1000ms...
Stopping NVIDIA host engine...
Host engine successfully terminated.
Done
damon008 commented 4 years ago

why?

guptaNswati commented 4 years ago

Did you follow all the pre-reqs listed here https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/exporters/prometheus-dcgm/k8s/pod-gpu-metrics-exporter#prerequisites and the supported GPU https://github.com/NVIDIA/gpu-monitoring-tools/tree/master/exporters/prometheus-dcgm#dcgm-supported-gpus

guptaNswati commented 4 years ago

Running the exporter on a supported GPU with all pre-reqs in place should solve it. Closing for now. Re-open if the problem still persists.