Closed Ssukriti closed 4 months ago
find some quality results here https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/621#issuecomment-86629029 using both collators
@alex-jw-brooks review requested is addressed, also added EOS token
Attached some quality results above.
Remaining TODOs:
The PR is ready to be reviewed @alex-jw-brooks @anhuong . I have documented only JSONL format for new input/output and will extend it to JSON in following PR .
Also current functionality is retained and we are just adding support for a new data format . I think it is safe to merge as one quality test looked good. we can continue to do quality testing before announcing release
@Ssukriti i think this PR is not accurately named, the changes should be to allow for custom processing by means of passing in the Jinja template? Collation is usually referred to the step where the examples come out of the sampler, and they are formed into a single tensor to be passed out of the dataloader.
Description of the change
Addition of support for JSON datasets containing input/output fields . Masking on input to leave output unmasked. Create single sequence by concatenation. Avoids datacollatorforcompletionLM that requires response template. This is because response template is very error prone and subject to it being present in text.
Refactored https://github.com/foundation-model-stack/fms-hf-tuning/pull/166/files a bit to allow extension to support pretokenized datasets which are masked.
Related issue number
How to verify the PR
Was the PR tested