foundation-model-stack / fms-hf-tuning

🚀 Collection of tuning recipes with HuggingFace SFTTrainer and PyTorch FSDP.
Apache License 2.0
28 stars 48 forks source link

refactor code to preprocess datasets #259

Closed Ssukriti closed 4 months ago

Ssukriti commented 4 months ago

Description of the change

Adds utilities to format datasets to single sequence. Retains current functionality in main. Moves around code for modularity and expansion of new dataset formats.

Utility functions:

  1. get_data_collator - to return data collator as needed
  2. format_dataset - > to format dataset to single sequence using template

Related issue number

How to verify the PR

Unit tests passed

Was the PR tested