foundation-model-stack / fms-hf-tuning

🚀 Collection of tuning recipes with HuggingFace SFTTrainer and PyTorch FSDP.
Apache License 2.0
28 stars 48 forks source link

feat: Enable JSON dataset compatibility #297

Closed willmj closed 3 months ago

willmj commented 3 months ago

Description of the change

While JSON documentation exists, there have been issues with loading JSON files, as most testing has been focused on JSONL files. To address this:

Related issue

While JSONL file format works, JSON file format is currently broken at least for a few kinds of data formats we support . https://github.com/foundation-model-stack/fms-hf-tuning/tree/main?tab=readme-ov-file#1-json-formats-with-a-single-s[…]-use-for-masking-on-completion We have documented JSON, but it doesn't actually work. Investigate why this is broken and fix.

How to verify the PR

Run unit tests added in test_sft_trainer.py, test_preprocessing_utils.py If any new tests need to be added or removed, let me know!

Was the PR tested

anhuong commented 3 months ago

I think the final thing you need to update is the docs, the main place i see is here since this references only using JSONL but now this accepts JSON