Q) Can zero-shot generalization instead be directly induced by explicit multitask learning?
we develop a system for easily mapping any natural language tasks into a human-readable prompted form.
convert a large set of supervised datasets, each with multiple prompts with diverse wording
fine-tune a pre-trained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multi-task mixture covering a wide variety of tasks
The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16× its size.
Introduction
In this paper, we focus on explicitly training language models in a supervised and massively multi-task fashion
Our approach uses a training mixture consisting of a large set of different tasks specified in natural language prompts
Our goal is to induce a model to better generalize to held-out tasks without requiring massive scale, as well as being more robust to the wording choices of the prompts
To convert a large set of natural language tasks into prompted form, we use a simple templating language for structured datasets.
develop an interface for prompt collection from public contributors that facilitated the collection of a large multitask mixture with multiple prompts per dataset
train a variant of the T5 encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on a subset of the tasks (each with multiple datasets) and then evaluate tasks and prompts that the model was not trained on.
Our experiments study two questions.
First, does multitask prompted training improve generalization to held-out tasks?
find that multitask training enables zero-shot task generalization by showing that our model matches or exceeds the performance of GPT-3
Second, does training on a wider range of prompts improve robustness to prompt wording?
find that training on more prompts per dataset consistently improves the median and decreases the variability of performance on held-out tasks
MEASURING GENERALIZATION TO HELD-OUT TASKS
This yields 12 tasks and 62 datasets with publicly contributed prompts in our training and evaluation mixtures (Figure 2) as of writing
Zero-shot 테스트 위해서 학습하지 않은 task에 대해서 평가
To test zero-shot generalization, we hold out all constituent datasets of four tasks: natural language inference (NLI), coreference resolution, sentence completion, and word sense disambiguation
We choose NLI as a held-out task because humans also zero-shot generalize to NLI as an held-out task: Most humans are never explicitly trained to classify whether a premise sentence entails or contradicts a hypothesis sentence, yet they find it intuitive to perform this task without training
we also hold out coreference resolution and word sense disambiguation. We further hold out sentence completion because it is a task possibly too similar to NLI (Appendix D.2 discusses this in detail)
Lastly, we further evaluate on a subset of the datasets from BIG-bench
A UNIFIED PROMPT FORMAT
To facilitate writing a large collection of prompts, we develop a templating language and an application that make it easy to convert diverse datasets into prompts
define a prompt as consisting of an input template and a target template, along with a collection of associated meta-data
The templates are functions mapping a data example into natural language for the input and target sequences
36 contributors affiliated with 24 institutions in 8 countries participated. Since our goal was to train a model to be robust to prompt format, and since the question of what makes a prompt effective remains unresolved (Webson and Pavlick, 2021; Logan et al., 2021; Zhao et al., 2021), we encouraged contributors to be open in their style and create a diverse set of prompts. The main annotation guideline was that prompts needed to be grammatical and understandable by a fluent English speaker with no prior experience of the tasks. Additionally, prompts that required explicit counting or numerical indexing were removed in favor of natural language variants. For example, instead of predicting indices of a span extracting answers from a passage, the model is expected to copy the span’s text instead. With these minimal constraints, prompt writers were encouraged to use both formal and creative prompts and various orderings of the data.
We collected prompts for English datasets, excluding ones that included potentially harmful content or non-natural language such as program- ming languages. We refer to this collection as the Public Pool of Prompts (P3). As of writing, P3 contains 2073 prompts for 177 datasets (11.7 prompts per dataset on average)
EXPERIMENTAL SETUP
Model
fine-tune a pretrained model on our multi-task training mixture of natural language prompted datasets
Our model uses an encoder-decoder architecture with input text fed to the encoder and target text produced by the decoder
we use Lester et al. (2021)’s LM-adapted T5 model (referred to as T5+LM), produced by training T5 on 100B additional tokens from C4 on a standard language modeling objective.
Training
Our main model, T0, is trained on the multitask mixture detailed in Section 3 and Table 5.
T0 variants are all initialized from the 11B parameters version of T5+LM
Meanwhile, T0+ is the same model with identical hyperparameters except trained on a mixture that adds GPT-3’s evaluation datasets.
Lastly, T0++ further adds SuperGLUE (Wang et al., 2019a) to the training mixture (except RTE and CB), which leaves NLI and the BIG-bench tasks as the only held-out tasks.
We assemble our multitask training mixture by combining and shuffling all examples from all training datasets.
This is equivalent to sampling from each dataset in proportion to the number of examples in the dataset.
However, the number of examples in each of our training datasets varies by two orders of magnitude. We therefore follow the strategy used in Raffel et al. (2020) and treat any dataset with over 500’000 examples as having 500’000 / num templates examples for the purposes of sampling, where num templates is the number of templates created for the dataset.
Hparams
truncate input and target sequences to 1024 and 256 tokens
use packing to combine multiple training examples into a single sequence to reach the maximum sequence length
잉.. 진짜? 내가 이해한게 맞는건가.. 확인해보자..! 논문 6페이지!
use a batch size of 1024 sequences (corresponding to 220 total in- put tokens per batch)
Adafactor optimizer
use a learning rate of 1e-3 and a dropout rate of 0.1.
Evaluation
evaluate zero-shot generalization on 11 datasets in 4 held-out traditional NLP tasks: natural language inference, coreference, word sense disambiguation, and sentence completion, as well as 14 novel tasks from BIG-bench (§3)
For tasks that involve choosing the correct completion from several options (e.g. multiple choice question answering)
use rank classification to evaluate our model: we compute the log-likelihood of each of the target options under the fine-tuned model and select the option with the highest log-likelihood as the prediction
각 선택지의 log-likelihood 중에 제일 높은거 선택했다는거 (다 이어서 붙이는게 아니라 -> 우리는 이어 붙여서 하고 있긴한데, 이 부분이 아니라는 뜻!)
For simplicity, we do not apply length normalization to the log-likelihoods of the target options.
We do not perform prompt selection by comparing the performance of different prompts on the vali- dation split
Perez et al. (2021) highlights how such a strategy leaks information from the evaluation splits, which makes the evaluation not “true” zero-shot.
prompts 셀렉하면 치팅이고 참 된 zero-shot 평가가 아니다 이거야
For a given dataset, we report the median performance across all prompts for this dataset along with their interquartile range (Q3 - Q1) to measure the model’s robustness to the wording of the prompts.
median으로 측정해서 wording에 대한 robustness 보겠다
RESULTS
GENERALIZATION TO HELD-OUT TASKS
PROMPT ROBUSTNESS
Our second research question is whether training on a wider range of prompts improves robustness to the wording of the prompts
We conduct two ablation experiments on the effects of the average number of prompts per dataset (p) and the number of datasets (d) used during training.
전체 prompts 평균수랑 prompts를 사용하는 데이터숫자의 변화에 따른 성능을 비교하는것!
Effect of More Prompts per Dataset
we fix d and compare T0 to models with a varying number of prompts per dataset.
T0 was trained on some prompts that do not map onto the dataset’s original task, for example “given an answer, generate a plausible question”. Including these prompts results in p being 8.03 on average (which corresponds to our main T0 model).
We compare T0 to models where p = 1 (one randomly chosen original-task prompt per dataset), p = 5.7 on average (all original-tasks prompts for all datasets), and p = 0 (corresponding to T5+LM without any prompted training)
prompts 개수 늘어날수록 좋다!
Effect of Prompts from More Datasets
we fix p = all available prompts and increase d from 39 to 49 to 55 (T0, T0+, T0++, respectively. See Section 5 for details.)
Figure 7 shows that the median performance of all 5 held-out datasets increases as d increases from 39 to 49. However, the spread only decreases for 1 out of 5 datasets. For some datasets (e.g., ANLI), this is an artifact of the fact that some prompts always perform poorly, so that when other prompts improve, the spread is stretched larger
어떤건 prompts 안좋을 수 있지만, 전체적으로 수가 커지면 좋아진다?
근데 또 어떤 케이스는 안그래
Although further investigation is needed, it appears that increasing ddoes not consistentlymake the model more robustto the wording of prompts.
데이터셋 늘어난다고해서 wording의 robustness가 늘어나는건 아냐!
prompts 수를 늘리는건 좋은데, 데이터셋 늘어난다고해서 robustness가 늘어나는건 아니니 어디까지 데이터셋을 늘려야될지가 또 관건이네
Comparing T0 and GPT-3(davinci)’s robustness
T0 could be more robust to prompt formulation than GPT-3.
DISCUSSION
Concurrent to our work, Wei et al. (2021) proposes FLAN, which shares largely the same method of enabling zero-shot generalization through multitask prompted training.
With a mixture of datasets similar to ours, they train multiple decoder-only language models, each with a single held-out task
비슷한 연구 있음(FLAN)
CONCLUSION
demonstrate that multitask prompted training can enable strong zero-shot generalization abilities in language models
This approach provides an effective alternative to unsupervised language model pretraining, often enabling our T0 model to outperform models many times its size
Note
Author
V Sanh (Hugging Face) 저술 · 2021
Abstract
Introduction
evaluate tasks and prompts
that the model wasnot
trained on.MEASURING GENERALIZATION TO HELD-OUT TASKS
A UNIFIED PROMPT FORMAT
define a prompt as consisting of an input template and a target template, along with a collection of associated meta-data
prompt writers were encouraged to use both formal and creative prompts and various orderings of the data.
P3 contains 2073 prompts for 177 datasets
(11.7 prompts per dataset on average
)EXPERIMENTAL SETUP
Model
LM-adapted T5 model (referred to as T5+LM)
, produced by training T5 on100B additional tokens
from C4 on a standard language modeling objective.Training
our multitask training mixture by combining and shuffling all examples from all training datasets
.equivalent
to sampling from each dataset inproportion
to the number of examples in the dataset.However
, the number of examples in each of our training datasets varies by two orders of magnitude. We therefore follow the strategy used in Raffel et al. (2020) andtreat any dataset with over 500’000 examples as having 500’000 / num templates
examples for the purposes ofsampling
, where num templates is the number of templates created for the dataset.packing to combine multiple training examples into a single sequence
to reach the maximum sequence lengthEvaluation
RESULTS
GENERALIZATION TO HELD-OUT TASKS
PROMPT ROBUSTNESS
robustness to the wording of the prompts
the average number of prompts per dataset (p)
andthe number of datasets (d) used during training.
Effect of More Prompts per Dataset
d
and compare T0 to models with a varying number of prompts per dataset.Effect of Prompts from More Datasets
it appears that increasing d
does not consistently
make the model more robust
to the wording of prompts.
prompts 수를 늘리는건 좋은데, 데이터셋 늘어난다고해서 robustness가 늘어나는건 아니니 어디까지 데이터셋을 늘려야될지가 또 관건이네
Comparing T0 and GPT-3(davinci)’s robustness
DISCUSSION
CONCLUSION