This projects employs CLIP (Paper: https://arxiv.org/pdf/2103.00020.pdf) as a backbone to perform image-text retrieval.
The results obtained from the publicly available weights from CLIP do not yield good results on Flickr30K and MSCOCO. The public model achieves:
Image-to-Text | Text-to-Image | ||||||
Dataset | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |
MSCOCO-1K | 26.1 | 64.6 | 81.2 | 48.0 | 77.5 | 88.2 | |
Flickr30k | 36.0 | 71.9 | 83.4 | 55.8 | 80.7 | 88.3 |
This project trains a non-linearity on top of CLIP features as a finetuning step to improve the learned representations. The added non-linear probe performs significantly better when fine-tuned in these datasets.
Please follow the installation requirements from the oficial CLIP repository:
https://github.com/openai/CLIP
This model requires to generate two txt files that include the images and the captions to be used by the model.
Run:
$ python generate_data.py
Modify the data_path accordingly in the dataloader.
To train in Flickr30K run:
$ python train.py --data_name f30k --logger_name runs/clip_ft_f30k
To train in MSCOCO run:
$ python train.py --data_name coco --logger_name runs/clip_ft_coco