bigscience-workshop / metadata

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.
Apache License 2.0
30 stars 12 forks source link

feat: save the model and stop training based on `exit-duration-in-mins` #67

Open SaulLu opened 3 years ago

SaulLu commented 3 years ago

As discussed a long time ago in a meeting it would be really great if we had a feature to save the model and stop training after a certain time as the jobs on the JZ cluster are limited to 20 hours.

For example, in the architecture and scaling working group, they added the exit-duration-in-mins argument the library used to run trainings Megatron-DeepSpeed


related: #37 (#42)

shanyas10 commented 3 years ago

does #42 serve half of the purpose (saving the model)?

SaulLu commented 3 years ago

Indeed your PR #42 is also really useful (it should be merged, I send you a private message about this)

What I have in mind with this issue is more to launch the backup after a certain time as the jobs on JZ are limited to 20h. If I'm not mistaken it's something that is not included in your current PR #42 right?