Closed dwadden closed 5 months ago
I tried fixing this by adding a flag --final_checkpoint_name
that tells the model to dump the final checkpoint in {args.output_dir}/{args.final checkpoint_name}
rather than in args.output_dir
. It doesn't seem to work, training is still hanging. Seems like there's some other issue.
I think probably the best thing to do is to run a training run on a beaker session and ctrl-c during the hang to try and work out where it is getting stuck. My guess is some funky deepspeed issue.
The PR fixes this; closing.
I tried to finetune an OLMo-7B model. Model training completed successfully according to the logs, and all intermediate checkpoints saved successfully. But the program seemed to hang when writing the output of the final model. See this Beaker job for an example. Hypothesis from @hamishivi is that there's some issue with saving to the top-level result directory specifically. Proposed fix: just save the final checkpoint to a subdirectory instead and see if that fixes it; can test by just running a job for like 100 steps and see if it saves successfully. If this sounds reasonable I can try to implement.