Data custom collator - Githubissues

foundation-model-stack / fms-hf-tuning

🚀 Collection of tuning recipes with HuggingFace SFTTrainer and PyTorch FSDP.

Apache License 2.0

28 stars 48 forks source link

Data custom collator #260

Closed Ssukriti closed 4 months ago

Ssukriti commented 4 months ago

Description of the change

Addition of support for JSON datasets containing input/output fields . Masking on input to leave output unmasked. Create single sequence by concatenation. Avoids datacollatorforcompletionLM that requires response template. This is because response template is very error prone and subject to it being present in text.

Refactored https://github.com/foundation-model-stack/fms-hf-tuning/pull/166/files a bit to allow extension to support pretokenized datasets which are masked.

Related issue number

How to verify the PR

Train using JSON input/output fields and leave dataset_text_field and response_template blank

Was the PR tested

Quality test below
Unit tests

[ ] I have added >=1 unit test(s) for every new method I have added.
[ ] I have ensured all unit tests pass

Ssukriti commented 4 months ago

find some quality results here https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/621#issuecomment-86629029 using both collators

Ssukriti commented 4 months ago

@alex-jw-brooks review requested is addressed, also added EOS token

Attached some quality results above.

Remaining TODOs:

Update README
I think JSON file format is broken (JSONL works ) - to check
Run more quality tests if have time to have further record

Ssukriti commented 4 months ago

The PR is ready to be reviewed @alex-jw-brooks @anhuong . I have documented only JSONL format for new input/output and will extend it to JSON in following PR .

Also current functionality is retained and we are just adding support for a new data format . I think it is safe to merge as one quality test looked good. we can continue to do quality testing before announcing release

fabianlim commented 4 months ago

@Ssukriti i think this PR is not accurately named, the changes should be to allow for custom processing by means of passing in the Jinja template? Collation is usually referred to the step where the examples come out of the sampler, and they are formed into a single tensor to be passed out of the dataloader.