allenai / open-instruct

Apache License 2.0
1.21k stars 166 forks source link

Add support for resuming from checkpoint #176

Closed jacob-morrison closed 1 month ago

jacob-morrison commented 3 months ago

We should add support for resuming training partway through, e.g. if a job is preempted. I think most of the functionality is there, but we'll need to 1) save more often (every epoch will probably be wasteful depending on how often we get preempted) and 2) handle --resume_from_checkpoint well.

We'll need to properly handle resume_from_checkpoint at the start of training and when resuming partway through, which will most likely involve looking for a checkpoint manually first before letting the automagic huggingface code resume training. We'll also need to not overwrite the output folder.

natolambert commented 2 months ago

Links to make this: