Mohammad Fahes1, Tuan-Hung Vu1,2, Andrei Bursuc1,2, Patrick Pérez3, Raoul de Charette1 1 Inria, Paris, France.
2 valeo.ai, Paris, France.
3 Kyutai, Paris, France.
TL; DR: CLIP projects visual embedding to the shared latent space using a linear projection layer. We show that simply fine-tuning this guy (:p) can be a strong alternative to linear probing, prompt tuning and CLIP-adapters, and performs also well on test-time adaptation.
Stay tuned for the code!
Paper: https://arxiv.org/abs/2410.05270
We fine-tune the pretrained linear projection layer of the vision encoder with a regularization loss towards the pre-trained weights.
@article{fahes2024fine,
title={Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia},
author={Fahes, Mohammad and Vu, Tuan-Hung and Bursuc, Andrei and P{\'e}rez, Patrick and de Charette, Raoul},
journal={arXiv preprint arXiv:2410.05270},
year={2024}
}