Closed thies1006 closed 2 years ago
Apparently the script works when removing --checkpoint-actications
.
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!
What is your question?
How do I run training&validation of a translation model with FSDP?
Code
What have you tried?
The example with the language model is running fine. Then I tried to merge the LM FSDP example with the ordinary example for translation. The above script runs training fine (5 steps), but afterwards it gets stuck in the validation part after the first example. GPUs and CPUs are continuing working.
Output:
(not continues, waited for ~20 min)
What's your environment?
pip
: