NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

How do I inject errors into the GPU hardware? #124

Closed eafayao closed 10 months ago

eafayao commented 10 months ago

https://github.com/NVIDIA/DCGM/blob/cc3fe64d966d956cebba3e3ff1334786dd767d35/dcgmlib/src/DcgmCacheManager.cpp#L4878C21-L4878C21 Hi, I'm using the error injection function of dcgm, and from the code, I don't find that the error is actually injected into the GPU hardware. I'd like to ask, is there any way to inject real errors into GPU hardware, such as ECC Errors?

glowkey commented 10 months ago

There is no way to inject the error into the GPU hardware. The injection API only injects the error into DCGM's internal cache.