NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
355 stars 49 forks source link

device memory ECC Errors can not take effect #126

Closed xiaohai1234 closed 8 months ago

xiaohai1234 commented 8 months ago

Hello, With the command "dcgmi test --inject --gpuid 0 -f 319 -v 4", I can inject Device memory double bit volatile ECC errors, and it will show successfully info. But if I use the command "nvidia-smi -q -d ECC" to check error, the ECC count will not change. Is there a way to inject real memory ECC faults ? thanks.

nikkon-dev commented 8 months ago

@xiaohai1234,

The dcgmi injected errors are just added values in the internal dcgm cache and only affect dcgm tools and policies. There is no way to inject a real ECC error into the hardware.

xiaohai1234 commented 8 months ago

@nikkon-dev , thanks.