OpenGVLab / CaFo

[CVPR 2023] Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners
MIT License
351 stars 18 forks source link

Prompt, Generate, then Cache

Official implementation of 'Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners'.

The paper has been accepted by CVPR 2023 🔥.

News

Introduction

We propose CaFo, a Cascade of Foundation models that incorporates diverse prior knowledge of various pre-trianing paradigms for better few-shot learning, including CLIP, DINO, DALL-E, and GPT-3. Specifically, CaFo works by `Prompt, Generate, then Cache'. We leverage GPT-3 to prompt CLIP with rich linguistic semantics and generate synthetic images via DALL-E to expand the few-shot training data. Then, we introduce a learnable cache model to adaptively blend the predictions from CLIP and DINO. By such collaboration, CaFo can fully unleash the potential of different pre-training methods and unify them to perform state-of-the-art for few-shot classification.

Requirements

Installation

Create a conda environment and install dependencies:

git clone https://github.com/ZrrSkywalker/CaFo.git
cd CaFo

conda create -n cafo python=3.7
conda activate cafo

pip install -r requirements.txt

# Install the according versions of torch and torchvision
conda install pytorch torchvision cudatoolkit

Dataset

Please follow DATASET.md to download official ImageNet and other 10 datasets.

Foundation Models

Get Started

Configs

The running configurations for different [dataset] with [k] shots can be modified in configs/[dataset]/[k]shot.yaml, including visual encoders and hyperparamters. We have provided the configurations for reproducing the results in the paper. You can edit the search_scale, search_step, init_beta and init_alpha for fine-grained tuning and better results.

Note that the default load_cache and load_pre_feat are False for the first running, which will store the cache model and val/test features in configs/dataset/. For later running, they can be set as True for faster hyperparamters tuning.

For Caltech101 dataset, the config of Stable Diffusion's images and ChatGPT's prompts is respectively in configs/sd_caltech101 and configs/chat_caltech101.

Running

For 16-shot ImageNet dataset:

CUDA_VISIBLE_DEVICES=0 python main_imagenet.py --config configs/imagenet/16shot.yaml

For other 10 datasets:

CUDA_VISIBLE_DEVICES=0 python main.py --config configs/dataset/16shot.yaml

Numerical Results

We provide CaFo's numerical results on 11 datasets from 1 to 16 shots at exp_Cafo.log. The results for Tip-Adapter and Tip-Adapter-F is at exp_Tip.log.

Acknowledgement

This repo benefits from Tip-Adapter, CLIP, DINO, DALL-E and CuPL. Thanks for their wonderful works.

Citation

@article{zhang2023prompt,
  title={Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners},
  author={Renrui Zhang and Xiangfei Hu and Bohao Li and Siyuan Huang and Hanqiu Deng and Hongsheng Li and Yu Qiao and Peng Gao},
  journal={arXiv preprint arXiv:2303.02151},
  year={2023}
}

Contributors

Renrui Zhang, Xiangfei Hu, Bohao Li

Contact

If you have any question about this project, please feel free to contact zhangrenrui@pjlab.org.cn and sjtuhxf@sjtu.edu.cn.