NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
923 stars 159 forks source link

Executing dcgmi diag -r 3 in dcgm-exporter, the prompt shows "nvvs binary was not found" #324

Closed 287400117 closed 6 months ago

287400117 commented 6 months ago

What is the version?

3.3.5-3.4.1

What happened?

Final troubleshooting revealed that there is a section of code in the Dockerfile that deletes the /usr/share/nvidia-validation-suite directory after installing datacenter-gpu-manager. 企业微信截图_4d7f6acb-ca8f-4c06-b585-ea898d117e15

What did you expect to happen?

The command dcgmi diag -r 3 can be executed normally.

What is the GPU model?

No response

What is the environment?

No response

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

No response

Anything else we need to know?

No response

glowkey commented 6 months ago

This is expected behavior. DCGM diagnostics were removed as they increase the size of the container and are not used by DCGM-Exporter. If DCGM diagnostics are needed, the standalone DCGM container has that functionality.