Use during training time

fffasttime / MRFI

MIT License

8 stars 1 forks source link

Hi.

Thank you for this great contribution. I would like to know if you can use MRFI during training time, e.g. to perform fault-tolerant training. As I understood, faults are injected as Pytorch hooks, but will these propagate the gradient correctly during training time considering the fault injection (decouple output from input for the outputs affected by the injection)?

Taking a look at the code, I see the forward hooks added at mrfi.MRFI.__add_hoks(), but I don't see that any backward hooks are implemented. So I assume that at this moment MRFI can't be used at training time. Please correct me if I'm wrong.

If that's the case, are there any plans to implement this feature? It would be a very useful feature to implement fault-tolerant training pipelines.

Thanks

fffasttime / MRFI

Use during training time #2