Spico197 / Humback

🐋 An unofficial implementation of Self-Alignment with Instruction Backtranslation.
https://arxiv.org/abs/2308.06259
Apache License 2.0
129 stars 8 forks source link

The problem of “Instruction Backtranslation” #7

Open JieDengsc opened 10 months ago

JieDengsc commented 10 months ago

Thanks for sharing!

I want to use my Chinese data for "Instruction Backtranslation". Do I need to modify only Seed Data and Unlabelled Data?

In addition, does "quality" in Seed Data need to be marked by itself? Because Seed Data of is the SFT data that has been manually checked, I do not know whether your Seed Data has other processing. At the same time, how can I provide "Unlabelled Data"?

Spico197 commented 10 months ago

Hi there, thanks for your question~

  1. If you plan to train the model from scratch, yes, you need to modify only Seed and Unlabelled Data.
  2. Sorry, I didn't get it. Do you mean whether we should manually check the Seed Data for best quality? Yes if you are using other datasets. Better seed data quality yield better instruction following abilities.
  3. You could use LLM's pre-training dataset as the unlabelled data. For example, Wanjuan 1.0, Wudao, and many other corpora.
JieDengsc commented 10 months ago

Hi there, thanks for your question~

  1. If you plan to train the model from scratch, yes, you need to modify only Seed and Unlabelled Data.
  2. Sorry, I didn't get it. Do you mean whether we should manually check the Seed Data for best quality? Yes if you are using other datasets. Better seed data quality yield better instruction following abilities.
  3. You could use LLM's pre-training dataset as the unlabelled data. For example, Wanjuan 1.0, Wudao, and many other corpora.

Thank you for your reply.

For the 2., I mean I found "instruction_quality" and "response_quality" in your "seed.jsonl", but I don't have both in my own SFT data. Does that affect my training?

In addition, I still don't understand "unlabelled_data". For example, if I have 3000 SFT data records and I want to generate more SFT data, does "unlabelled_data" refer to other SFT data? Or txt text data?

Thank you again

Spico197 commented 10 months ago
  1. It's ok to leave those items blank or remove them. These columns are just used for the OASST data filtering in the pre-processing stage.
  2. Unlabelled data refers to raw texts. The instruction backtranslation regards these texts as responses and generate their instructions to build new SFT data.