Why is Gradient Checkpointing Not Implemented in Training?

Checks

[X] This template is only for question, not feature requests or bug reports.
[X] I have thoroughly reviewed the project documentation and read the related paper(s).
[X] I have searched for existing issues, including closed ones, no similar questions.
[X] I confirm that I am using English to submit this report in order to facilitate communication.

Question details

It appears that gradient checkpointing is not implemented in the current training pipeline. Gradient checkpointing can significantly reduce memory usage by trading off computation, making it valuable for large models and resource-limited environments. This raises the question:

Is there a specific reason for not implementing gradient checkpointing? If possible, could it be integrated in future updates, or are there known limitations that prevent its integration? If there is no compatibility issue, I would be open to exploring the possibility of adding it via a PR.

SWivid / F5-TTS

Why is Gradient Checkpointing Not Implemented in Training? #399

Checks

Question details