This repository contains code accompanying our preprint paper "CrossFit :weight_lifting:: A Few-shot Learning Challenge for Cross-task Generalization in NLP" (Paper).
CrossFit :weight_lifting: is a task setup which aims at building few-shot learners that generalize across diverse NLP tasks. For example, we explore whether models trained with non-classification tasks becomes good few-shot learner for classfication tasks; whether models trained with non-MRC QA tasks becomes good few-shot learners for MRC QA tasks.
NLP Few-shot Gym :sweat_drops: is a repository of 160 different NLP tasks that we gather from existing open-access datasets. We manually create a two-level task ontology to analyze cross-task generalization in different settings.
[:memo: Update 2022-04-15] The code is based on old versions of transformers and torch. If you need to develop on top of newer versions please reference this issue.
[:memo: Update 2022-04-15] We found some formatting issue during dataset processing. Several tasks may be affected. If you have used the previous version, please update the data files using the latest code.
# Create a new conda environment (optional)
conda create -n crossfit python=3.6.9
conda activate crossfit
# For building the NLP Few-shot Gym
pip install datasets==1.4.0 py7zr wget
# For reproducing the baseline methods
pip install torch==1.1.0 higher==0.2.1 scikit-learn==0.24.1 scipy==1.4.1 rouge==1.0.0
pip install git+https://github.com/huggingface/transformers.git@7b75aa9fa55bee577e2c7403301ed31103125a35
The following code will automatically prepare the data using :hugs: huggingface datasets, reconstruct the few-shot train/dev sets we sampled, and verify the files with MD5Sum. The processing will take roughly 3 hours.
cd tasks
# Construct the gym
# --n_proc=10 means the tasks will be prosessed in parallel with 10 subprocesses.
python _build_gym.py --build --n_proc=10
# Verify with MD5Sum
python _build_gym.py --verify
If the processing is successful, the verification script will output [Success] All files are consistent.
If the processing for any individual task goes wrong (e.g., some datasets are hosted on google drive and there is daily quota issue), you can re-try later by running individual scripts.
# For example, if you want to construct glue_sst2
cd tasks
python glue_sst2.py
Disclaimer: We use publicly-available datasets from :hugs: huggingface datasets to construct the few-shot gym. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use them. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license. If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included, please contact us!
:smiley: Please check ./example_scripts
for more examples!
Here we take BoolQ as an example. There are five different samples of train/dev sets for BoolQ in the directory data/boolq/
. For each sample, we do a grid search over learning rate (1e-5, 2e-5, 5e-5) and batch size (2, 4, 8).
This script will not save the final model, however the results will be logged in a csv file in --output_dir
.
python tune_hps_singletask.py \
--task_dir data/boolq/ \
--do_train \
--do_predict \
--learning_rate_list 1e-5 2e-5 5e-5 \
--bsz_list 2 4 8 \
--total_steps 1000 \
--eval_period 100 \
--warmup_steps 100 \
--model facebook/bart-base \
--output_dir models/singletask-boolq \
--predict_batch_size 32 \
Notes:
--checkpoint $CHECKPOINT
.data/boolq/
../example_scripts
.Upstream learning refers to the stage between general pre-training and down-stream fine-tuning. In this stage we allow access to a set of training tasks, and we test the few-shot learning ability on a set of test tasks after this stage. Please check Table 1 in our preprint paper for more details about task partitions.
We include two upstream learning methods: multi-task learning and MAML (model-agnostic meta-learning). We are currently working on first-order meta-learning algorithms (First-order MAML and Reptile)!
Notes:
--custom_tasks_splits
.--total_steps
in the scripts above are pre-computed so that the learning rate decreases to zero linearly during learning. We also pre-compute --warmup_steps
to be 6% of the total steps.Here we provide the checkpoints after upstream learning.
Task Partition | Multi-task | Meta-learn |
---|---|---|
1. Random | multi-task-random-bart-base.pt | meta-learn-random-bart-base.pt |
:smiley: Please stay tuned for more checkpoints!
./example_scripts/finetune_a_list_of_tasks.sh
will help you fine-tune a list of tasks sequentially, given a certain model initialization../example_scripts/collect_results.py
will read each results.csv
files in a given directory, then compute mean and standard deviation of dev/test performance.We thank authors and crowd-workers of all resources used in our study! This work would not have been possible without your efforts. We thank :hugs: huggingface datasets team for making datasets more accessible. Our code is modified from shmsw25/bart-closed-book-qa, thanks to the authors!
If you find bugs in our code, encounter problems when running the code, or have suggestions for the CrossFit project, please submit an issue, or reach out to Qinyuan (qinyuany@usc.edu) and Bill (yuchen.lin@usc.edu)!
If you used our code in your study, or find our paper useful, please cite us with the bibkey ye-etal-2021-crossfit
in the official ACL Anthology, or use the following BibTeX: