In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (w/o corresponding text). We demonstrate that DIVA improves CLIP's performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7% ↑), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that DIVA preserves CLIP's strong zero-shot capabilities.
Given an image, the CLIP model encodes the visual features as the main part of condition, then the generative diffusion model predicts the added noise taking the noisy image and condition as input. We optimize the CLIP's representation by maximizing the image likelihood with the diffusion loss via generative feedback.
Clone this repository and install the required packages:
git clone https://github.com/baaivision/DIVA.git
cd DIVA
mkdir -p outputs logs datasets pretrained_weights/CLIP pretrained_weights/SD
conda create -n diva python=3.9
conda activate diva
pip install -r requirements.txt
Core packages:
For data preparation, please refer to image2dataset and MMVP for the employed training and evaluation data in this work. After collecting the corresponding datasets, directly put them into the dataset/
folder path.
As for pre-trained weight preparation, please refer to OpenAI ViT-L-14/224&336, MetaCLIP ViT-L/H-14, SigLIP ViT-SO-14/224, SigLIP ViT-SO-14/384, DFN ViT-H-14/224, DFN ViT-H-14/378 and SD-2-1-base to acquire the model weights for discriminative CLIP models and the leveraged diffusion model that provides generative feedback. After downloading all these necessary weights, move them respectively to the corresponding folder path pretrained_weights/CLIP/
and pretrained_weights/SD/
.
For the preparation for our DIVA's condition design, some source code in the installed CLIP and OpenCLIP packages need to be modified.
For OpenAI CLIP, use the content in our provided condition/OpenAICLIP_for_clip_model.py
to replace the content in Your Conda Installation Path/anaconda3/envs/diva/lib/python3.9/site-packages/clip/model.py
.
For MetaCLIP and DFN, use the content in our provided condition/MetaCLIP_for_openclip_transformer.py
and condition/DFN_for_openclip_transformer.py
to replace the content in Your Conda Installation Path/anaconda3/envs/diva/lib/python3.9/site-packages/open_clip/transformer.py
, respectively.
For SigLIP, use the content in our provided condition/SigLIP_for_timm_models_visiontransformer.py
to replace the content in Your Conda Installation Path/anaconda3/envs/diva/lib/python3.9/site-packages/timm/models/vision_transformer.py
.
After all the above preparation steps, you can simply start training for our DIVA with the following command:
# For OpenAICLIP
bash DIVA_for_OpenAICLIP.sh
# For MetaCLIP
bash DIVA_for_MetaCLIP.sh
# For SigLIP
bash DIVA_for_SigLIP.sh
# For DFN
bash DIVA_for_DFN.sh
Method | Image Size | Params (M) | Average Score |
---|---|---|---|
OpenAI ViT-L-14 | 224² | 427.6 | 25.9 (+6.6) |
OpenAI ViT-L-14 | 336² | 427.9 | 25.2 (+5.2) |
MetaCLIP ViT-L-14 | 224² | 427.6 | 27.4 (+3.7) |
MetaCLIP ViT-H-14 | 224² | 986.1 | 31.9 (+6.7) |
SigLIP ViT-SO-14 | 224² | 877.4 | 40.7 (+2.9) |
SigLIP ViT-SO-14 | 384² | 878.0 | 38.5 (+1.5) |
DFN ViT-H-14 | 224² | 986.1 | 43.7 (+4.4) |
DFN ViT-H-14 | 378² | 986.7 | 37.8 (+3.0) |
It is worth noting that, due to the randomness among the introduced condition design during the training phase and the selection of local patch tokens during the inference phase for OpenAI CLIP, the obtained scores on MMVP_VLM benchmark using our provided OpenAI CLIP weights might not be the same as the reported results in our paper. At this time, we recommend trying different random seeds multiple times if the scores do not meet expectations.
DIVA is built upon the awesome Diffusion-TTA, MMVP, CLIP, OpenCLIP, timm.
@article{wang2024diffusion,
title={Diffusion Feedback Helps CLIP See Better},
author={Wang, Wenxuan and Sun, Quan and Zhang, Fan and Tang, Yepeng and Liu, Jing and Wang, Xinlong},
journal={arXiv preprint arXiv:2407.20171},
year={2024}
}