This repository contains the code to reproduce the experiments from the paper "Enhancing CLIP with CLIP: Pseudolabeling for Limiter-Labeled Prompt Tuning". The paper explores the effect of leveraging pseudolabels to adapt vision-language models such as CLIP to downstream tasks in a unified way across prompt modalities, learning paradigms, and training strategies.
To set up the project environment, follow these steps:
Ensure that you have Python version 3.7.4 installed. You can check the Python version by running the following command:
python --version
Clone the repository by running the following command:
git clone https://github.com/BatsResearch/menghini-enhanceCLIPwithCLIP-code.git
Navigate to the root folder and execute the setup.sh
script to install the required dependencies, including pytorch
. Note that we assume the installation of a CUDA-compatible version of pytorch
since GPUs are recommended for running the experiments. If you don't have access to GPUs, you can modify the script to remove the CUDA requirement.
cd menghini-enhanceCLIPwithCLIP-code/
bash setup.sh
The experiments are conducted on the following six datasets: Flowers102, RECSIS45, FGVC-Aircraft, MNIST, EuroSAT, and DTD (FRAMED). We use the train and test splits provided in the paper ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models.
To access the FRAMED dataset, you can download it here. After downloading, unzip the folder to obtain the required data.
If you encounter any issues with the download or prefer an alternative method, you can follow these steps:
dtd/
to DTD/
eurosat_clip/
to EuroSAT/
fgvc-aircraft-2013b-variants102/
to FGVCAircraft/
oxford-flower-102/
to Flowers102/
mnist/
to MNIST/
resisc45_clip/
to RESICS45/
DTD/
should contain the class_names.txt
fileEuroSAT/
should contain the class_names.txt
fileFGVCAircraft/
should contain the labels.txt
fileFlowers102/
should contain the class_names.txt
fileMNIST/
should contain the labels.txt
fileBefore running the experiments, create the following folders to save prompts, pseudolabels, and results.
mkdir pseudolabels
mkdir logs
mkdir trained_prompts
mkdir evaluation
We organized the code such that for each learning paradigm we can run any combination of prompt modality and training strategy.
bash scripts/run_clip.sh
bash scripts/run_prompts_ssl.sh
bash scripts/run_prompts_trzsl.sh
To execute the training strategies employing pseudolabels across prompt modalities run the following
For SSL:
bash scripts/run_pseudolabels_ssl.sh
For UL:
bash scripts/run_pseudolabels_ul.sh
For TRZSL:
bash scripts/run_pseudolabels_trzsl.sh
Logs of the runs are save in logs/
.
The folder pseudolabels/
gathers the pseudolabeled used for each prompt modality, leanring paradigms, and training strategies. For iterative methods, we store them at each iteration.
In trained_prompts/
, we save the prompts used to make predictions. For iterative methods, we save the prompts at each iteration.
While in evaluation/
there will be the predictions of each method.
[1] Learning Transferable Visual Models From Natural Language Supervision, Radford et al. 2021
[2] Learning to prompt for vision-language models, Zhou et al. 2021
[3] Visual prompt tuning, Jia et al. 2022
[4] Unified vision and language prompt learning, Zang et al., 2022
To be filled with the results obtained from the experiments.
Textual prompts | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Flowers102 | RESICS45 | FGVCAircraft | ||||||||
Method | SSL | UL | TRZSL | SSL | UL | TRZSL | SSL | UL | TRZSL | |
CLIP [1] | 63.7 | 63.7 | 63.4 | 54.5 | 54.5 | 54.5 | 17.6 | 17.6 | 17.9 | |
CoOp [2] | 76.8 | - | 63.2 | 58.5 | - | 63.4 | 14.9 | - | 21.7 | |
GRIP | 83.6 | 69.8 | 86.3 | 74.1 | 70.6 | 81.1 | 17.0 | 15.2 | 26.1 | |
MNIST | EuroSAT | DTD | ||||||||
CLIP | 25.1 | 25.1 | 20.8 | 32.9 | 32.9 | 30.5 | 43.2 | 43.2 | 43.4 | |
CoOp [2] | 56.4 | - | 21.2 | 59.5 | - | 49.7 | 37.1 | - | 46.3 | |
GRIP | 71.8 | 67.9 | 74.1 | 58.7 | 57.2 | 92.3 | 56.1 | 46.1 | 65.3 |
Visual prompts | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Flowers102 | RESICS45 | FGVCAircraft | ||||||||
Method | SSL | UL | TRZSL | SSL | UL | TRZSL | SSL | UL | TRZSL | |
CLIP [1] | 63.7 | 63.7 | 63.4 | 54.5 | 54.5 | 54.5 | 17.6 | 17.6 | 17.9 | |
VPT [3] | 63.7 | - | 64.7 | 60.8 | - | 67.1 | 17.8 | - | 26.7 | |
GRIP | 67.9 | 63.1 | 77.2 | 71.2 | 68.4 | 82.2 | 19.4 | 17.5 | 26.4 | |
MNIST | EuroSAT | DTD | ||||||||
CLIP [1] | 25.1 | 25.1 | 20.8 | 32.9 | 32.9 | 30.5 | 43.2 | 43.2 | 43.4 | |
VPT [3] | 42.5 | - | 25.5 | 47.1 | - | 62.2 | 36.4 | - | 44.2 | |
GRIP | 69.7 | 68.0 | 69.5 | 63.5 | 63.7 | 97.0 | 54.6 | 50.5 | 62.8 |
Multimodal prompts | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Flowers102 | RESICS45 | FGVCAircraft | ||||||||
Method | SSL | UL | TRZSL | SSL | UL | TRZSL | SSL | UL | TRZSL | |
CLIP [1] | 63.7 | 63.7 | 63.4 | 54.5 | 54.5 | 54.5 | 17.6 | 17.6 | 17.9 | |
UPT [4] | 68.0 | - | 61.1 | 62.8 | - | 58.8 | 11.1 | - | 15.9 | |
GRIP | 74.6 | 64.8 | 82.0 | 73.7 | 69.4 | 82.2 | 17.4 | 14.7 | 17.9 | |
MNIST | EuroSAT | DTD | ||||||||
CLIP | 25.1 | 25.1 | 20.8 | 32.9 | 32.9 | 30.5 | 43.2 | 43.2 | 43.4 | |
UPT [4] | 64.4 | - | 63.6 | 68.9 | - | 60.4 | 43.7 | - | 36.9 | |
GRIP | 65.9 | 68.2 | 73.8 | 60.4 | 61.5 | 95.5 | 54.1 | 47.4 | 64.4 |
If you find this work helpful, please consider citing the following paper:
@inproceedings{
menghini2023enhancing,
title={Enhancing {CLIP} with {CLIP}: Exploring Pseudolabeling for Limited-Label Prompt Tuning},
author={Cristina Menghini and Andrew Delworth and Stephen Bach},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023},
url={https://openreview.net/forum?id=2b9aY2NgXE}
}