The toolkit for image classification in the benchmark: Evaluation of Language-augmented Visual Task-level Transfer [ELEVATER].
Please follow the steps below to use this codebase to reproduce the results in the paper, and onboard your own checkpoints & methods.
Our code base is developed and tested with PyTorch 1.7.0, TorchVision 0.8.0, CUDA 11.0, and Python 3.7.
conda create -n elevater python=3.7 -y
conda activate elevater
conda install pytorch==1.7.0 torchvision==0.8.0 cudatoolkit=11.0 -c pytorch
pip install -r requirements.txt
pip install -e .
We support the downstream evaluation of image classification on 20 datasets: Caltech101
, CIFAR10
, CIFAR100
, Country211
, DTD
, EuroSat
, FER2013
, FGVCAircraft
, Food101
, GTSRB
, HatefulMemes
, KittiDistance
, MNIST
, Flowers102
, OxfordPets
, PatchCamelyon
, SST2
, RESISC45
, StanfordCars
, VOC2007
. Our toolkit also supports ImageNet-1K
evaluation, whose result is shown as reference on the leaderboard.
To evaluate on these datasets, our toolkit automatically downloads these datasets once with vision-datasets
and store them locally for the future usage. You do NOT need to explicitly download any datasets. However, if you are interested in downloading all data before running experiments, please refer to [Data Download].
ELEVATER benchmark supports three types of the evaluation: zeroshot, linear probe, and finetuning. We have embodied all three types of the evaluation into a unified launch script: run.sh
. By specifying different arguments, you may enable different settings, including:
num_shots=5
: the number of images in few-shot learning; default=5. {5, 20, 50} for few shot, and -1 for full-shotrandom_seed=0
: it specifies the subset of dataset samples used in few-shot; default=0. We conisder [0,1,2] in our benchmark.init_head_with_text_encoder=True
: whether or not to init the linear head with the proposed language-augmented method, eg, text encoder outputmerge_encoder_and_proj=False
whether or not to merge the encoder projection and the linear headuse_wordnet_hierachy=False
: WordNet hierachy knowledge is used or not.use_wordnet_definition=False
: WordNet definition knowledge is used or not.use_wiktionary_definition=False
: Wiktionary definition knowledge is used or not.use_gpt3=False
: GPT3 knowledge is used or not.use_gpt3_count=0
: the number of GPT3 knowledge items used: [1,2,3,4,5]To run the benchmark toolkit, please refer to the instructions in run.sh
and modify accordingly. By default, ./run.sh
will run the zeroshot evaluation of the CLIP ViT/B-32 checkpoint on Caltech-101 dataset.
You may need to launch multiple experiments in batch as ELEVATER benchmark contains 20 datasets. We provide an example script run_multi.sh
where you can specify different configurations from command line directly without modifying the shell script.
DATASET=caltech101 \
OUTPUT_DIR=./output/experiment \
bash run_multi.sh
You can refer to run_multi.sh
to add other customizable configurations. Examples are dataset
and output_dir
.
Our implementation and prompts are from OpenAI repo: [Notebook] [Prompt].
For zero-shot evaluation, we support both the model from the CLIP repo and customized models.
To evaluate customized model for zeroshot evaluation, you need to:
vision_benchmark/models
, and register it in vision_benchmark/models/__init__.py
.clip_
, see the example vision_benchmark/models/clip_example.py
.encode_image()
, which will be used to extract image features.encode_text()
, which will be used to extract text features.get_zeroshot_model(config)
, which is used to create the model.resources/model/clip_example.yaml
We use automatic hyperparameter tuning for linear probe and finetuning evaluation. For details, please refer to Appendix Sec. D of our paper.
Models evaluated here can be models from:
To evaluate customized model, you need to:
vision_benchmark/models
, and register it in vision_benchmark/models/__init__.py
.cls_
, see the example vision_benchmark/models/cls_example.py
.forward_features()
, which will be used to extract features.get_cls_model(config)
, which is used to create the model.resources/model/example.yaml
Leaderboard submission are supported via EvalAI. Please first generate the prediction files locally, and then submit the results to Eval AI. Details are documented as below.
You need to evaluate and generate prediction files for all 20 datasets before submitting to the leaderboard. However, to test that the pipeline is working correctly, you can submit partial evaluation results. The partially evaluated results can be found from the link under "Result file" column. You may also optionally make them appear on the leaderboard, but the "Average Score" will not be computed as the results are not complete.
To generate the prediction files, follow the steps below:
Verify that prediction file submission is supported. Prediction file generation is only supported after commit 2c7a53c3
. Please make sure that your local copy of our code base is up-to-date.
Generate prediction files for all datasets separately. Please make sure to modify output folder accordingly so that 20 prediction files for the same configuration will appear within the same folder.
# Modify these two lines accordingly in run.sh
DATASET=caltech101 \
OUTPUT_DIR=./output/exp_1_submit \
bash run_multi.sh
/path_to_predictions
contains all 20 JSON prediction files (60 files [20 datasets * 3 seeds] for few-shot experiments). The combined prediction file will be located at /path_to_predictions/all_predictions.zip
python commands/prepare_submit.py \
--combine_path /path_to_predictions
Please check out the format illustration and examples for prediction files in submission_file_readme.md
Navigate to Leaderboard tab to view all baseline results and results from the community.
Modify these three lines accordingly in run_gpt3.sh, and run sh run_gpt3.sh
OUTPUT_DIR=./output/exp_1_extract_knowledge # the path that the generated gpt3 knowledge is saved
apikey=XXXX # Please use your GPT3 API key
ds='cifar10'
Please cite our paper as below if you use the ELEVATER benchmark or our toolkit.
@article{li2022elevater,
title={ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models},
author={Li, Chunyuan and Liu, Haotian and Li, Liunian Harold and Zhang, Pengchuan and Aneja, Jyoti and Yang, Jianwei and Jin, Ping and Lee, Yong Jae and Hu, Houdong and Liu, Zicheng and Gao, Jianfeng},
journal={Neural Information Processing Systems},
year={2022}
}