NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

how can I clear stale XID error #112

Open zdyang opened 11 months ago

zdyang commented 11 months ago

We employ the dcgm-exporter to monitor our GPU cluster. Occasionally, we come across an XID error referred to as "XID 31". This error is typically caused by a user program. Interestingly, even after exiting the program, the XID error persists. I am curious if there is a method to resolve this outdated XID error. Thank you for your assistance.