arcee-ai / arcee-python

The Arcee client for executing domain-adpated language model routines https://pypi.org/project/arcee-py/
https://www.arcee.ai
25 stars 5 forks source link

Upload HF dataset in chatml format as qa pairs #40

Closed tleyden closed 5 months ago

tleyden commented 5 months ago

This PR adds the ability to upload a hugging face dataset in ChatML format as QA pairs into Arcee.

It only supports single turn rows. If it is a multiturn row, with more than one user + assistant conversation, it will print a warning and discard the row.

For now it only supports ChatML where the ChatML is in the messages column.

Testing

In a python shell, run this:

import arcee
arcee.upload_hugging_face_dataset_qa_pairs("qa_set_name", "org/dataset", "train", "chatml")

Output:

Uploading 207865 QA pairs
Uploading 1000 QA pairs..
Uploaded 1000 QA pairs..
...
Jacobsolawetz commented 5 months ago

Let's make entry identical to other methods, e.g. arcee.

Seems like most users will have a dataset of prompt and completion and unaware where they will have multi-turn chatML

Also seems a bit limited to only have first+second pieces...

Will obviously approve this nonetheless if we need it

tleyden commented 5 months ago

Let's make entry identical to other methods, e.g. arcee.

Ok good call, I didn't notice the naming pattern!

Seems like most users will have a dataset of prompt and completion and unaware where they will have multi-turn chatML

Yes in that case, they would pass in "prompt_completion" as the csv_format value (probably need to rename to data_format) and it would expect flat prompt/completion columns.

I was planning to add that later in a follow-up PR as needed.

Also seems a bit limited to only have first+second pieces...

You mean missing qa + prompt_completion? If you can point to some HF datasets to test with that would be helpful.