Open JieDengsc opened 10 months ago
Hi there, thanks for your question~
Hi there, thanks for your question~
- If you plan to train the model from scratch, yes, you need to modify only Seed and Unlabelled Data.
- Sorry, I didn't get it. Do you mean whether we should manually check the Seed Data for best quality? Yes if you are using other datasets. Better seed data quality yield better instruction following abilities.
- You could use LLM's pre-training dataset as the unlabelled data. For example, Wanjuan 1.0, Wudao, and many other corpora.
Thank you for your reply.
For the 2., I mean I found "instruction_quality" and "response_quality" in your "seed.jsonl", but I don't have both in my own SFT data. Does that affect my training?
In addition, I still don't understand "unlabelled_data". For example, if I have 3000 SFT data records and I want to generate more SFT data, does "unlabelled_data" refer to other SFT data? Or txt text data?
Thank you again
Thanks for sharing!
I want to use my Chinese data for "Instruction Backtranslation". Do I need to modify only Seed Data and Unlabelled Data?
In addition, does "quality" in Seed Data need to be marked by itself? Because Seed Data of is the SFT data that has been manually checked, I do not know whether your Seed Data has other processing. At the same time, how can I provide "Unlabelled Data"?