NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
387 stars 50 forks source link

Fault injection in my pytorch training job #192

Open hjx620 opened 6 days ago

hjx620 commented 6 days ago

I want to do low cost error recovery from deep learning training failures. So I need to simulate some errors in my pytorch training file to test my system.

I find that DCGM has the ability of fault injection, such as:

How can I use them in my pytorch file? Thanks

glowkey commented 2 days ago

Injecting errors into DCGM does not inject errors into the driver, NVML, or any other layer lower than DCGM itself. If your pytorch code integrates with DCGM to determine GPU health (using the DCGM health API for example), then injecting an error will trigger a callback. But if your pytorch code relies on anything lower than DCGM in the software stack then error injection will have no effect.