Closed tleyden closed 5 months ago
Let's make entry identical to other methods, e.g. arcee.
Seems like most users will have a dataset of prompt
and completion
and unaware where they will have multi-turn chatML
Also seems a bit limited to only have first+second pieces...
Will obviously approve this nonetheless if we need it
Let's make entry identical to other methods, e.g. arcee.
Ok good call, I didn't notice the naming pattern!
Seems like most users will have a dataset of prompt and completion and unaware where they will have multi-turn chatML
Yes in that case, they would pass in "prompt_completion" as the csv_format
value (probably need to rename to data_format
) and it would expect flat prompt/completion columns.
I was planning to add that later in a follow-up PR as needed.
Also seems a bit limited to only have first+second pieces...
You mean missing qa + prompt_completion? If you can point to some HF datasets to test with that would be helpful.
This PR adds the ability to upload a hugging face dataset in ChatML format as QA pairs into Arcee.
It only supports single turn rows. If it is a multiturn row, with more than one user + assistant conversation, it will print a warning and discard the row.
For now it only supports ChatML where the ChatML is in the
messages
column.Testing
In a python shell, run this:
Output: