Open SaulLu opened 2 years ago
does #42 serve half of the purpose (saving the model)?
Indeed your PR #42 is also really useful (it should be merged, I send you a private message about this)
What I have in mind with this issue is more to launch the backup after a certain time as the jobs on JZ are limited to 20h. If I'm not mistaken it's something that is not included in your current PR #42 right?
As discussed a long time ago in a meeting it would be really great if we had a feature to save the model and stop training after a certain time as the jobs on the JZ cluster are limited to 20 hours.
For example, in the architecture and scaling working group, they added the
exit-duration-in-mins
argument the library used to run trainingsMegatron-DeepSpeed
related: #37 (#42)