AndresPMD / Clip_CMR

CLIP-based simple image-text matching baseline for COCO and F30K
12 stars 3 forks source link

CLIP based Image Text Matching

This projects employs CLIP (Paper: https://arxiv.org/pdf/2103.00020.pdf) as a backbone to perform image-text retrieval.

The results obtained from the publicly available weights from CLIP do not yield good results on Flickr30K and MSCOCO. The public model achieves:

Image-to-Text Text-to-Image
Dataset R@1 R@5 R@10 R@1 R@5 R@10
MSCOCO-1K 26.1 64.6 81.2 48.0 77.5 88.2
Flickr30k 36.0 71.9 83.4 55.8 80.7 88.3

This project trains a non-linearity on top of CLIP features as a finetuning step to improve the learned representations. The added non-linear probe performs significantly better when fine-tuned in these datasets.

Install

Please follow the installation requirements from the oficial CLIP repository:

https://github.com/openai/CLIP

Generate Data

This model requires to generate two txt files that include the images and the captions to be used by the model.

Run:

$ python generate_data.py

Train

Modify the data_path accordingly in the dataloader.

To train in Flickr30K run:

$ python train.py --data_name f30k --logger_name runs/clip_ft_f30k

To train in MSCOCO run:

$ python train.py --data_name coco --logger_name runs/clip_ft_coco

License

Apache License 2.0