foundation-model-stack / fms-hf-tuning

🚀 Collection of tuning recipes with HuggingFace SFTTrainer and PyTorch FSDP.
Apache License 2.0
28 stars 48 forks source link

feat: add save_model_dir flag where final checkpoint saved #291

Closed anhuong closed 3 months ago

anhuong commented 3 months ago

Description of the change

Related issue number

217

How to verify the PR

Tested:

  1. save_strategy=“no” and save_model_dir to diff path than output_dir --> verified saves final model and does not save any checkpoints in ´output_dir` only logs

  2. save_total_limit=2 and output_dir set (akasave_model_dir not set) --> only checkpoints are saved with logs

  3. save_strategy=“no” and output_dir==save_model_dir --> verified that logs and model saved to path

  4. save_strategy="epoch” and save_total_limit=2andoutput_dir==save_model_dir` --> checkpoint dirs, model, and training logs are all written to path

  5. accelerate_launch.py

    • save_total_limit=3 and save_model_dir==output_dir --> same as 4, checkpoints, training logs, and model outputted to path
    • save_strategy="no" and save_model_dir==output_dir --> same as 3, only model and logs outputted to path
    • save_total_limit=1 and output_dir subdir of save_model_dir --> output_dir with checkpoints and logs inside of save_model_dir
    • save_total_limit=1 and save_model_dir subdir of output_dir --> output_dir has checkpoints, logs, and dir with model
  6. accelerate_launch: Finally I also verified that the lm_head removal continued to work as expected:

    • save_total_limit=1, save_model_dir==output_dir, granite-3b-code-base --> verified that with lora and ft the model was saved to given path with lm_head removed but the checkpoint didn't have lm_head removed
    • save_strategy="no", save_model_dir==output_dir, granite-3b-code-base --> verified lm_head removed and no addt checkpoints saved
    • save_total_limit=1, output_dir, granite-3b-code-base (aka no save_model_dir given) --> verified lm_head removed from final checkpoint

Ran vLLM inference on LoRA and fine tuned llama-13b-base model that was saved in separate and same dir. Fine tuning got good "no complaint" inference result, LoRA got poor results after tuning but marginal improvement from base model.

Was the PR tested

anhuong commented 3 months ago

Wanted to call out a note from the description: For lm_head removal, removes lm_head from save_model_dir if exists, otherwise removes from final checkpoint. Is this the behavior we want or do we only want to remove lm_head if save_model_dir passed?

anhuong commented 3 months ago

Have you verified saved model, using save() still infers on vLLM?

Yes this is described in the description of the PR above

Ran vLLM inference on LoRA and fine tuned llama-13b-base model that was saved in separate and same dir. Fine tuning got good "no complaint" inference result, LoRA got poor results after tuning but marginal improvement from base model.