An unofficial implementation of Self-Alignment with Instruction Backtranslation .
The proposed Humback is a novel framework that can augment the instruction data for supervised fine-tuning with high quality.
🚧 Currently, this repo is under construction and not finished.
Procedure (2 iters):
We follow the original paper and use oasst1 to construct the seed data.
The processed data could be found here .
$ bash data/seed/download.sh
$ python data/seed/convert.py
# #data: 3286, #dump: 3200
# Instruction len: 149±266, Response len: 1184±799
Since ClueWeb22 is not a free open-source dataset, we sample texts from falcon-refinedweb instead.
The processed data could be found here .
$ python data/unlabelled/falcon_refinedweb.py
Item | Value |
---|---|
Foundation Model | meta-llama/Llama-2-7b-hf |
GPUs | 8 * A100 40GB |
Mixed Precision | bf16 |
Gradient Checkpointing | on |
ZeRO-Offload | Stage 2 |
Batch size | 32 |
Steps | 500 |
# The first Myx training takes about 30min (on the seed data)
$ bash scripts/train_backward_Myx.sh
The pre-trained $M_{yx}$ is available at Huggingface.
The augmentation data is available at Huggingface .
# Taking about 6:40:45 on the unlabelled data with 8*A100
$ bash scripts/self_aug.sh
Hyper parameters are the same as $M_{yx}$.
$ bash scripts/train_seed.sh
The pre-trained $M_{0}$ is available at Huggingface (Uploading).
The curated data is available at Huggingface .
# 33:54:45 with 8*A100 on 482,963 samples
$ bash scripts/self_curation.sh
# scores: [('None', 217203), ('4', 119211), ('3', 102756), ('5', 21301), ('1', 13083), ('2', 9288), ('8', 19), ('0', 15), ('9', 14), ('7', 11), ('6', 9), ('10', 4), ('91', 3), ('83', 2), ('20', 2), ('14', 2), ('75', 2), ('92', 2), ('72', 1), ('93', 1), ('28', 1), ('19', 1), ('728', 1), ('17', 1), ('16', 1), ('100', 1), ('237', 1), ('13', 1), ('73', 1), ('38', 1), ('87', 1), ('94', 1), ('98', 1), ('64', 1), ('52', 1), ('27', 1), ('24', 1), ('762', 1), ('266', 1), ('225', 1), ('80', 1), ('267', 1), ('99', 1), ('90', 1), ('63', 1), ('97', 1), ('78', 1), ('40', 1), ('1986', 1), ('47', 1), ('66', 1), ('45', 1), ('10502', 1), ('21', 1)]
# Number of qualified results (scores=5): 21301/482963
# instruction len: 198 ± 351
# response len: 1601 ± 345
# ---------------------------------------
# v2: (Strict Curation Score Matching: add `$` to the matching regex):
# Scores: [('None', 322324), ('3', 71851), ('4', 53120), ('5', 16460), ('1', 11921), ('2', 7260), ('0', 10), ('7', 4), ('6', 3), ('19', 1), ('8', 1), ('16', 1), ('13', 1), ('10', 1), ('23', 1), ('9', 1), ('90', 1), ('92', 1), ('45', 1)]
# Number of qualified results (scores=5): 15521/482963
# instruction len: 124 ± 113
# response len: 1611 ± 345
# ---------------------------------------
$ cat outputs/m1/unlabelled_curated_data.jsonl data/seed/seed.jsonl > data/curated/m1.jsonl
Most hyper parameters are the same as $M_{yx}$ except for the number of steps (the original Humback trains 1600 steps on 512k samples).
# change the `--data_path` in `scripts/train_seed.sh`
$ bash scripts/train_seed.sh
Other models: HuggingFaceH4/open_llm_leaderboard .
Model | Average | ARC | HellaSwag | MMLU | TruthfulQA |
---|---|---|---|---|---|
Llama-2-7b | 54.32 | 53.07 | 78.59 | 46.87 | 38.76 |
Llama-2-7b-chat | 56.34 | 52.90 | 78.55 | 48.32 | 45.57 |
Vicuna-7b-v1.3 | 55.62 | 50.43 | 76.92 | 48.14 | 47.01 |
Humback $M_{0}$ | 58.13 | 56.31 | 81.20 | 47.45 | 47.59 |
Humback $M_{1}$ | 54.65 | 52.99 | 78.57 | 45.48 | 41.54 |
Humback $M_{1,\text{w/o DiffSysPrompt,TemplateVicuna1.1}}$ | 55.85 | 52.82 | 78.53 | 45.86 | 46.21 |
Humback $M_{1,\text{w/o DiffSysPrompt,TemplateVicuna1.1,StrictCurationScoreMatching}}$ | 54.26 | 53.50 | 78.52 | 45.19 | 39.83 |
Humback $M_{1,\text{w/o DiffSysPrompt,TemplateVicuna1.1,StrictCurationScoreMatching,1200steps}}$ | 56.67 | 56.23 | 81.10 | 46.46 | 42.89 |
Humback $M_{1,\text{w/o DiffSysPrompt,TemplateVicuna1.1,StrictCurationScoreMatching,1800steps}}$ | 57.58 | 57.68 | 81.78 | 46.13 | 44.74 |
Humback $M_{1,\text{w/o DiffSysPrompt,TemplateVicuna1.1,StrictCurationScoreMatching,2400steps}}$ | 56.96 | 55.89 | 80.83 | 45.84 | 45.30 |
The results and the trend are not as good as the original paper, but the performance of $M{0}$ is better than vanilla llama2-7b. Specifically, Humback $M{1}$ is worse than $M{0}$, and the different system prompts seem not be helpful on these benchmarks. By the way, although $M{0}$ is good at these benchmarks, it may be not good at generating high quality and diversified responses on more tasks. Further experiments should be conducted to verify the effectiveness of the reproduced Humback $M_{0}$ (e.g. alpaca_eval with GPT4 as the judge).
Possible reasons are:
Since I don't have GPT4 API keys, chatgpt_fn
is used as the evaluator here (as introduced in alpaca_eval):
win_rate standard_error n_total avg_length
gpt4 73.79 1.54 805 1365
claude 70.37 1.60 805 1082
chatgpt 66.09 1.66 805 811
wizardlm-13b 65.16 1.67 805 985
vicuna-13b 64.10 1.69 805 1037
guanaco-65b 62.36 1.71 805 1249
oasst-rlhf-llama-33b 62.05 1.71 805 1079
alpaca-farm-ppo-human 60.25 1.72 805 803
falcon-40b-instruct 56.52 1.74 805 662
text_davinci_003 50.00 0.00 805 307
alpaca-7b 45.22 1.74 805 396
HumbackM0 32.30 1.65 805 548
text_davinci_001 28.07 1.56 805 296
HumbackM1 23.35 1.49 805 1522
🔥 Further discussions are fully welcomed.
@misc{li2023selfalignment,
title={Self-Alignment with Instruction Backtranslation},
author={Xian Li and Ping Yu and Chunting Zhou and Timo Schick and Luke Zettlemoyer and Omer Levy and Jason Weston and Mike Lewis},
year={2023},
eprint={2308.06259},
archivePrefix={arXiv},
primaryClass={cs.CL}
}