astra-vision / ProLIP

Fine-tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia
14 stars 0 forks source link
contrastive-language-image-pretraining few-shot-classifcation parameter-efficient-fine-tuning

Fine-tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia

Mohammad Fahes1, Tuan-Hung Vu1,2, Andrei Bursuc1,2, Patrick Pérez3, Raoul de Charette1
1 Inria, Paris, France.

2 valeo.ai, Paris, France.

3 Kyutai, Paris, France.

TL; DR: CLIP projects visual embedding to the shared latent space using a linear projection layer. We show that simply fine-tuning this guy (:p) can be a strong alternative to linear probing, prompt tuning and CLIP-adapters, and performs also well on test-time adaptation.

Stay tuned for the code!

Paper: https://arxiv.org/abs/2410.05270

ProLIP

We fine-tune the pretrained linear projection layer of the vision encoder with a regularization loss towards the pre-trained weights.

Citation

@article{fahes2024fine,
  title={Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia},
  author={Fahes, Mohammad and Vu, Tuan-Hung and Bursuc, Andrei and P{\'e}rez, Patrick and de Charette, Raoul},
  journal={arXiv preprint arXiv:2410.05270},
  year={2024}
}