NVIDIA / gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
Apache License 2.0
1.01k stars 301 forks source link

dcgm-exporter can't run #212

Open JohanOu opened 3 years ago

JohanOu commented 3 years ago

It logs this: root@octopus-worker1:/home/practice# docker logs d674af870ff5 Starting NVIDIA host engine... Got error 11 while waiting for SIGUSR1 from child process. Collecting metrics at /run/prometheus/dcgm.prom every 1000ms... Stopping NVIDIA host engine... Unable to terminate host engine, it may not be running. /usr/local/bin/dcgm-exporter: line 141: kill: (12) - No such process Done

How to solve it?Thanks

yongqiangz commented 2 years ago

@JohanOu I have the same issue, have you solved it?

yongqiangz commented 2 years ago

in my case, it is because i upgrade GPU driver version, so when i upgrage dcgm-exporter version too, it works fine.