anhuong commented 3 months ago

Description of the change

Add optional save_model_dir flag where final checkpoint can be saved to using trainer.save_model()
- Note that this only saves the model, not the optimizer states
- Adds minimal save() to sft_trainer
output_dir is reserved for checkpoint saving and training logs. This param is still required to pass in even if no checkpoints are saved using save_strategy="no"
- Note that with this update, the training logs will be streamed to output_dir
Update accelerate_launch.py:
- Remove tempdir which caused issues with ephemeral storage
- Remove copy final checkpoint since this is now moved into sft_trainer
- For lm_head removal, removes lm_head from save_model_dirif exists otherwise removes from final checkpoint

Related issue number

217

How to verify the PR

Tested:

save_strategy=“no” and save_model_dir to diff path than output_dir --> verified saves final model and does not save any checkpoints in ´output_dir` only logs
save_total_limit=2 and output_dir set (akasave_model_dir not set) --> only checkpoints are saved with logs
save_strategy=“no” and output_dir==save_model_dir --> verified that logs and model saved to path
save_strategy="epoch” and save_total_limit=2andoutput_dir==save_model_dir` --> checkpoint dirs, model, and training logs are all written to path
accelerate_launch.py
- save_total_limit=3 and save_model_dir==output_dir --> same as 4, checkpoints, training logs, and model outputted to path
- save_strategy="no" and save_model_dir==output_dir --> same as 3, only model and logs outputted to path
- save_total_limit=1 and output_dir subdir of save_model_dir --> output_dir with checkpoints and logs inside of save_model_dir
- save_total_limit=1 and save_model_dir subdir of output_dir --> output_dir has checkpoints, logs, and dir with model
accelerate_launch: Finally I also verified that the lm_head removal continued to work as expected:
- save_total_limit=1, save_model_dir==output_dir, granite-3b-code-base --> verified that with lora and ft the model was saved to given path with lm_head removed but the checkpoint didn't have lm_head removed
- save_strategy="no", save_model_dir==output_dir, granite-3b-code-base --> verified lm_head removed and no addt checkpoints saved
- save_total_limit=1, output_dir, granite-3b-code-base (aka no save_model_dir given) --> verified lm_head removed from final checkpoint

Ran vLLM inference on LoRA and fine tuned llama-13b-base model that was saved in separate and same dir. Fine tuning got good "no complaint" inference result, LoRA got poor results after tuning but marginal improvement from base model.

Was the PR tested

[x] I have added >=1 unit test(s) for every new method I have added.
[x] I have ensured all unit tests pass

anhuong commented 3 months ago

Wanted to call out a note from the description: For lm_head removal, removes lm_head from save_model_dir if exists, otherwise removes from final checkpoint. Is this the behavior we want or do we only want to remove lm_head if save_model_dir passed?

anhuong commented 3 months ago

Have you verified saved model, using save() still infers on vLLM?

Yes this is described in the description of the PR above

Ran vLLM inference on LoRA and fine tuned llama-13b-base model that was saved in separate and same dir. Fine tuning got good "no complaint" inference result, LoRA got poor results after tuning but marginal improvement from base model.

foundation-model-stack / fms-hf-tuning

feat: add save_model_dir flag where final checkpoint saved #291

Description of the change

Related issue number

217

How to verify the PR

Was the PR tested