Jason Wei∗, Maarten Bosma∗, Vincent Y. Zhao∗, Kelvin Guu∗, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le
Google Research
Abstract
explores a simple method for improving the zero-shot learning abilities of language models.
instruction tuning—finetuning language models on a collection of datasets described via instructions—substantially improves zero-shot performance on unseen tasks.
137B parameter pretrained language model and instruction tune it on over 60 NLP datasets verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types.
Introduction
We take a pretrained language model of 137B parameters and perform instruction tuning—finetuning the model on a mixture of more than 60 NLP datasets expressed via natural language instructions. We refer to this resulting model as FLAN, for Finetuned Language Net.
FLAN’s zero-shot also outperforms 175B-parameter GPT-3’s zero-shot on 20 of 25 datasets that we evaluate, and even outperforms GPT-3’s few-shot by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze.
The motivation of instruction tuning is to improve the ability of language models to respond to NLP instructions. The idea is that by using supervision to teach an LM to perform tasks described via instructions, the LM will learn to follow instructions and do so even for unseen tasks.
To evaluate performance on unseen tasks, we group datasets into clusters by task type and hold out each task cluster for evaluation while instruction tuning on all remaining clusters.
TASKS & TEMPLATES
We aggregate 62 text datasets that are publicly available on Tensorflow Datasets, including both language understanding and language generation tasks, into a single mixture.
While most of the ten templates describe the original task, to increase diversity, for each dataset we also include up to three templates that “turned the task around,” (e.g., for sentiment classification we include templates asking to generate a movie review).
영화리뷰 sentiment가 뭐야?라고 할수도있지만 sentiment가 이럴때에 해당되는 영화리뷰 생성해줘! 라고 해서 생성능력도 키웠다! (템플릿을 2배로 활용했다는 느낌)
We then instruction tune a pretrained language model on the mixture of all datasets, with examples in each dataset formatted via a randomly selected instruction template for that dataset
템플릿은 랜덤하게 선택하는군
어떤 템플릿이 좋다 이런건 없나?
EVALUATION SPLITS
to evaluate zero-shot FLAN on c task clusters, we instruction tune c models, where each model holds out a different task cluster for evaluation.
c task clusters에 대해서 평가하고 싶으면 c task가 없는 클러스터에 대해서 학습하고 평가해라
CLASSIFICATION WITH OPTIONS
For classification tasks, prior work (Brown et al., 2020) used a rank classification approach where, for example, only two outputs (“yes” and “no”) are considered and the higher probability one is taken as the model’s prediction
답변의 분포 때문에! 논리적으로 보이지만 완벽하지 않다!
logically sound, it is imperfect in that the probability mass for answers may have an undesired distribution among ways of saying each answer
OPTIONS suffix 추가
Therefore, we include an options suffix, in which we append the token OPTIONS to the end of a classification task along with a list of the output classes for that task
This makes the model aware of which choices are desired when responding to classification tasks
분포문제 때문에 이상한 답변 주지 않게 트릭을 쓰는거네! 그럴듯하다
TRAINING DETAILS
Model architecture and pretraining
use LaMDA-PT, a dense left-to-right, decoder-only transformer language model of 137B parameters
pretrained on a collection of web documents (including those with computer code), dialog data, and Wikipedia, tokenized into 2.49T BPE tokens with a 32k vocabulary using the SentencePiece library
Around 10% of the pretraining data was non-English. Note that LaMDA-PT only has language model pretraining (c.f. LaMDA, which was finetuned for dialog)
LaMDA-PT는 그냥 LM이고 LaMDA가 대화모델!, 10%정도 다른언어면 우리도 넣으면 좋지않을려나
Instruction tuning procedure
FLAN is the instruction-tuned version of LaMDA-PT, FLAN 자체도 그럼 엄청 좋은 모델을 Backbone으로 쓰고 있던거네..?!
Our instruction tuning pipeline mixes all datasets and randomly samples from each dataset.
To balance the different sizes of datasets, we limit the number of training examples per dataset to 30k and follow the examples-proportional mixing scheme (Raffel et al., 2020) with a mixing rate maximum of 3k
Q) 이거 궁금했는데, 한 데이터셋 내에서 개수를 3만개로 제한하고...가 아니라 weight를 제한한다는건데 이거 보자
In this mixing scheme, a mixing rate maximum of 3,000 means that a dataset does not receive additional sampling weight for examples in excess of 3,000.
잘 이해가 안가네, 중복 포함일까? 그냥 weight를 조절한다는건가 적절히..?
hparams
finetune all models for 30k gradient steps with a batch size of 8,192 tokens using the Adafactor Optimizer (Shazeer & Stern, 2018) with a learning rate of 3e-5
Q) batch size가 tokens로 나오네, instruction으로 다시 만들어진 데이터셋을 LM 튜닝하듯이 하는건가?
The input and target sequence lengths used in finetuning are 1024 and 256, respectively
We use packing (Raffel et al., 2020) to combine multiple training examples into a single sequence, separating inputs from targets using a special EOS token.
input target 구분 위해서 사용한거! 기억
This instruction tuning takes around 60 hours on a TPUv3 with 128 cores.
60시간이면 꽤 걸렸네..?!
final checkpoint trained for 30k steps
RESULTS
Overall, we observe that instruction tuning is very effective on tasks naturally verbalized as instructions (e.g., NLI, QA, translation, struct-to-text) and is less effective on tasks directly formulated as language modeling, where instructions would be largely redundant (e.g., commonsense reasoning and coreference resolution tasks that are formatted as finishing an incomplete sentence or paragraph).
리던던트한게 잘 안되는구나
결과 보니까 BERT, T5 같은 모델이 그래도 잘하긴 잘하네..
ABLATION STUDIES & FURTHER ANALYSIS
NUMBER OF INSTRUCTION TUNING CLUSTERS
이거 분석 좋네
we examine how performance is affected by the number of clusters and tasks used in instruction tuning
평가쪽
we hold out NLI, closed-book QA, and commonsense reasoning as evaluation clusters
결과
We show results for one to seven instruction tuning clusters, where clusters are added in decreasing order of number of tasks per cluster.
implying that performance may further improve with even more clusters added to instruction tuning
이건 약간 T0와는 다른 결과인거 같기도한데 체크해봐야겠음
SCALING LAWS
small model은 capacity가 낮으므로 instruction tuning하면 튜닝한거 배우느라 다 capa를 써버려서 unseen에 대해서 낮게 나온다 라고 해석함
ROLE OF INSTRUCTIONS
모델의 능력이 instructions에서 오는지, 아니면 원래 있는건지 확인
we explore the role of instructions during finetuning, as one possibility is that performance gains come entirely from multi-task finetuning and the model could perform just as well without instructions
We hence consider two finetuning setups without instructions.
[1] In a no template setup, only inputs and outputs were given to the model (e.g., for translation the input would be “The dog runs.” and the output would be “Le chien court.”).
[2] In a dataset name setup, each input is prepended with the name of the task and dataset (e.g., for translation to French, the input would be “[Translation: WMT’14 to French] The dog runs.”).
compare these two ablations to FLAN’s finetuning procedure, which used natural instructions (e.g., “Please translate this sentence to French: ‘The dog runs.’”)
FLAN이 성능이 제일 좋긴한데, 데이터셋이름을 앞에 붙이는 것도 성능이 생각보다 괜찮네?
INSTRUCTIONS WITH FEW-SHOT EXEMPLARS
fewshot 넣으면 조금 더 좋아지긴한다
INSTRUCTION TUNING FACILITATES PROMPT TUNING
Prompt tuning 결과도 더 좋긴한데, 이건 이미 한번 봐서 그런건 아닌가? 아 읽어보니 아니네, 그러면 잘하는거 맞을듯
DISCUSSION
Our instruction-tuned model, FLAN, improves performance against an untuned model and surpasses zero-shot GPT-3 on the majority of tasks that we evaluate on.
Ablation studies reveal that performance on unseen tasks improves with the number of instruction tuning task clusters, and, interestingly, that performance improvements from instruction tuning emerge only with sufficient model scale.
Moreover, instruction tuning can be combined with other prompting methods such as few-shot prompting and prompt tuning.
CONCLUSIONS
This paper has explored a simple method for improving the ability of language models at scale to perform zero-shot tasks based purely on instructions. Our instruction-tuned model, FLAN, compares favorably against GPT-3 and signals the potential ability for language models at scale to follow instructions.
Appendix
Multiple FLAN outputs hparams
Multiple FLAN outputs are generated via random sampling with a temperature of 0.9 and top k of 40.
examples
보통 change, recommend, generate, suggest, make up, answer in Langauge 등 마지막에 여러 명령(?)으로 해결할 수 있게 해놓음
Note
Author
Abstract
Introduction
FLAN: INSTRUCTION TUNING IMPROVES ZERO-SHOT LEARNING
TASKS & TEMPLATES
EVALUATION SPLITS
CLASSIFICATION WITH OPTIONS
TRAINING DETAILS
RESULTS
ABLATION STUDIES & FURTHER ANALYSIS
NUMBER OF INSTRUCTION TUNING CLUSTERS
SCALING LAWS
ROLE OF INSTRUCTIONS
no template setup
, only inputs and outputs were given to the model (e.g., for translation the input would be“The dog runs.”
and the output would be“Le chien court.”
).“[Translation: WMT’14 to French] The dog runs.”
).natural instructions
(e.g., “Please translate this sentence to French: ‘The dog runs.’”)INSTRUCTIONS WITH FEW-SHOT EXEMPLARS
INSTRUCTION TUNING FACILITATES PROMPT TUNING
DISCUSSION
CONCLUSIONS
Appendix
temperature of 0.9 and top k of 40
.