This repository contains the reference code for the ACM MM2022 paper "Progressive Tree-Structured Prototype Network for End-to-End Image Captioning"
Paper Access | MSCOCO Leaderboard (TeamName:CMG) | Baidu Disk
To run this code, pre-trained vision backbones, MSCOCO raw pictures and annotations should be downloaded.
mkdir $DataPath/coco_caption/
mkdir $DataPath/resume_model/
mkdir $DataPath/saved_models/
mkdir PTSN/saved_transformer_models/
Pre-trained vision backbones:
please download SwinT-B/16_22k_224x224 (password:swin) and put it in $DataPath/resume_model/ As for the other backbones(e.g. SwinT-L 384x384), you can download them at their offical link.
Raw data:
please download train2014.zip, val2014.zip and test2014.zip. Then unzip and put these files in $DataPath/coco_caption/IMAGE_COCO/ .
Annotations:
please download annotations and put it in $DataPath/coco_caption/annotations/
Other data:
please download trained_models (passwd:ptsn) It includes word_embeds.pth, hyper_protos.pth, trained checkpoints and training logs. Put word_embeds.pth and hyper_protos.pth in PTSN/. Put checkpoints in $DataPath/saved_models/
To reproduce the results of our paper, do the following two steps:
modify the /path/to/data in ./test_ptsn.sh into $DataPath
please run the code below:
cd ./PTSN
sh test_ptsn.sh
To train a Swin-B version of our PTSN model, do the following two steps:
cd ./PTSN
sh train_ptsn.sh
Note that it takes 4 v100 GPUs and around 50 hours to train this model.
To cite our paper, please use following BibTex:
@inproceedings{PTSN,
author = {Pengpeng Zeng and
Jinkuan Zhu and
Jingkuan Song and
Lianli Gao},
title = {Progressive Tree-Structured Prototype Network for End-to-End Image
Captioning},
booktitle = {ACM MM},
pages = {5210--5218},
year = {2022},
}