NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
847 stars 151 forks source link

Could not enable kubernetes metric collection: nvml: Unknown Error #329

Open 287400117 opened 3 months ago

287400117 commented 3 months ago

What is the version?

3.1.8-3.1.5

What happened?

When DCGM_REMOTE_HOSTENGINE_INFO is configured in dcgm-exporter, occasional errors may occur after the dcgm-exporter Pod is rebuilt, but the issue can be resolved by restarting the container using docker restart. The error message is as follows:

image

What did you expect to happen?

rt

What is the GPU model?

No response

What is the environment?

No response

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

No response

Anything else we need to know?

No response

nvvfedorov commented 3 months ago

@287400117 , Try to use the latest version of the dcgm-exporter.

287400117 commented 3 months ago

@287400117 , Try to use the latest version of the dcgm-exporter.

330 There will be another error in the latest version.