allenai / open-instruct

Apache License 2.0
1.22k stars 166 forks source link

Finetuning OLMo model hangs at end of training #123

Closed dwadden closed 5 months ago

dwadden commented 7 months ago

I tried to finetune an OLMo-7B model. Model training completed successfully according to the logs, and all intermediate checkpoints saved successfully. But the program seemed to hang when writing the output of the final model. See this Beaker job for an example. Hypothesis from @hamishivi is that there's some issue with saving to the top-level result directory specifically. Proposed fix: just save the final checkpoint to a subdirectory instead and see if that fixes it; can test by just running a job for like 100 steps and see if it saves successfully. If this sounds reasonable I can try to implement.

dwadden commented 5 months ago

I tried fixing this by adding a flag --final_checkpoint_name that tells the model to dump the final checkpoint in {args.output_dir}/{args.final checkpoint_name} rather than in args.output_dir. It doesn't seem to work, training is still hanging. Seems like there's some other issue.

hamishivi commented 5 months ago

I think probably the best thing to do is to run a training run on a beaker session and ctrl-c during the hang to try and work out where it is getting stuck. My guess is some funky deepspeed issue.

dwadden commented 5 months ago

The PR fixes this; closing.