Closed anhuong closed 3 months ago
Wanted to call out a note from the description: For lm_head removal, removes lm_head from save_model_dir
if exists, otherwise removes from final checkpoint. Is this the behavior we want or do we only want to remove lm_head if save_model_dir
passed?
Have you verified saved model, using save() still infers on vLLM?
Yes this is described in the description of the PR above
Ran vLLM inference on LoRA and fine tuned llama-13b-base model that was saved in separate and same dir. Fine tuning got good "no complaint" inference result, LoRA got poor results after tuning but marginal improvement from base model.
Description of the change
save_model_dir
flag where final checkpoint can be saved to usingtrainer.save_model()
save()
to sft_traineroutput_dir
is reserved for checkpoint saving and training logs. This param is still required to pass in even if no checkpoints are saved usingsave_strategy="no"
save_model_dir
if exists otherwise removes from final checkpointRelated issue number
217
How to verify the PR
Tested:
save_strategy=“no”
andsave_model_dir
to diff path thanoutput_dir
--> verified saves final model and does not save any checkpoints in ´output_dir` only logssave_total_limit=2
andoutput_dir
set (akasave_model_dir
not set) --> only checkpoints are saved with logssave_strategy=“no”
andoutput_dir==save_model_dir
--> verified that logs and model saved to pathsave_strategy="epoch”
and save_total_limit=2and
output_dir==save_model_dir` --> checkpoint dirs, model, and training logs are all written to pathaccelerate_launch.py
save_total_limit=3
andsave_model_dir==output_dir
--> same as 4, checkpoints, training logs, and model outputted to pathsave_strategy="no"
andsave_model_dir==output_dir
--> same as 3, only model and logs outputted to pathsave_total_limit=1
andoutput_dir
subdir ofsave_model_dir
--> output_dir with checkpoints and logs inside of save_model_dirsave_total_limit=1
andsave_model_dir
subdir ofoutput_dir
--> output_dir has checkpoints, logs, and dir with modelaccelerate_launch: Finally I also verified that the lm_head removal continued to work as expected:
Ran vLLM inference on LoRA and fine tuned llama-13b-base model that was saved in separate and same dir. Fine tuning got good "no complaint" inference result, LoRA got poor results after tuning but marginal improvement from base model.
Was the PR tested