TensorFlow implementation of Deep Cross-Modal Pojection Learning for Image-Text Matching accepted by ECCV 2018.
We propose a cross-modal projection matching (CMPM) loss and a cross-modal projection classication (CMPC) loss for learning discriminative image-text embeddings.
Please download Flickr30k Dataset (About 4.4GB)
Please download JSON Annotations
Convert the Flickr30k image-text data into TFRecords (About 15GB)
cd builddata & sh scripts/format_and_convert_flickr.sh 0
Please Download Pretrained ResNet-v1-152 checkpoint
Train CMPM with ResNet-152 + Bi-LSTM on Flickr30k
sh scripts/train_flickr_cmpm.sh 0
Train CMPM + CMPC with ResNet-152 + Bi-LSTM on Flickr30k
sh scripts/train_flickr_cmpm_cmpc.sh 0
sh scripts/test_flickr_cmpm.sh 0
We also provide the code for MSCOCO and CUHK-PEDES, which has similar preparation&training&testing procedures with Flickr30k
Be careful with the disk space (The MSCOCO may cost 20.1GB for images and 77.6GB for TFRecords)
If you find CMPL useful in your research, please kindly cite our paper:
@inproceedings{ying2018CMPM,
author = {Ying Zhang and Huchuan Lu},
title = {Deep Cross-Modal Projection Learning for Image-Text Matching},
booktitle = {ECCV},
year = {2018}}
If you have any questions, please feel free to contact zydl0907@mail.dlut.edu.cn