Closed xiaohai1234 closed 8 months ago
@xiaohai1234,
The dcgmi injected errors are just added values in the internal dcgm cache and only affect dcgm tools and policies. There is no way to inject a real ECC error into the hardware.
@nikkon-dev , thanks.
Hello, With the command "dcgmi test --inject --gpuid 0 -f 319 -v 4", I can inject Device memory double bit volatile ECC errors, and it will show successfully info. But if I use the command "nvidia-smi -q -d ECC" to check error, the ECC count will not change. Is there a way to inject real memory ECC faults ? thanks.