foundation-model-stack / fms-hf-tuning

🚀 Collection of tuning recipes with HuggingFace SFTTrainer and PyTorch FSDP.
Apache License 2.0
28 stars 48 forks source link

fix: crash when output directory doesn't exist #365

Closed HarikrishnanBalagopal closed 1 month ago

HarikrishnanBalagopal commented 1 month ago

Description of the change

Related issue number

How to verify the PR

Was the PR tested

github-actions[bot] commented 1 month ago

Thanks for making a pull request! 😃 One of the maintainers will review and advise on the next steps.

anhuong commented 1 month ago

@HarikrishnanBalagopal You created two PRs with slightly different changes, should we look at this PR or PR #364 ? One references model_args.output_dir while the other references training_args.output_dir

anhuong commented 1 month ago

Also if you have done any testing with multi-GPU, please let us know as unit tests are running on CPU

kmehant commented 1 month ago

@anhuong Please review https://github.com/foundation-model-stack/fms-hf-tuning/pull/364 Thank you.

HarikrishnanBalagopal commented 1 month ago

@HarikrishnanBalagopal You created two PRs with slightly different changes, should we look at this PR or PR #364 ? One references model_args.output_dir while the other references training_args.output_dir

@anhuong This is the change required for the wca branch. PTAL at #364 for the required change in the main branch.

HarikrishnanBalagopal commented 1 month ago

Also if you have done any testing with multi-GPU, please let us know as unit tests are running on CPU

Yes I have tested the same command with 1, 4 and 8 GPUs multiple times to ensure that the race condition doesn't occur. Each process will try to create the output_dir and ignore if it exists already.