NovaMind-Z / PTSN

Repository for an end-to-end image captioning method PTSN(ACM MM22).
61 stars 3 forks source link

PTSN: Progressive Tree-Structured Prototype Network

This repository contains the reference code for the ACM MM2022 paper "Progressive Tree-Structured Prototype Network for End-to-End Image Captioning"

Progressive Tree-Structured Prototype Network

Paper Access | MSCOCO Leaderboard (TeamName:CMG) | Baidu Disk

Environment setup

Data preparation

To run this code, pre-trained vision backbones, MSCOCO raw pictures and annotations should be downloaded.

mkdir $DataPath/coco_caption/
mkdir $DataPath/resume_model/
mkdir $DataPath/saved_models/
mkdir PTSN/saved_transformer_models/
  1. Pre-trained vision backbones:

    please download SwinT-B/16_22k_224x224 (password:swin) and put it in $DataPath/resume_model/ As for the other backbones(e.g. SwinT-L 384x384), you can download them at their offical link.

  2. Raw data:

    please download train2014.zip, val2014.zip and test2014.zip. Then unzip and put these files in $DataPath/coco_caption/IMAGE_COCO/ .

  3. Annotations:

    please download annotations and put it in $DataPath/coco_caption/annotations/

  4. Other data:

    please download trained_models (passwd:ptsn) It includes word_embeds.pth, hyper_protos.pth, trained checkpoints and training logs. Put word_embeds.pth and hyper_protos.pth in PTSN/. Put checkpoints in $DataPath/saved_models/

Inference procedure

To reproduce the results of our paper, do the following two steps:

  1. modify the /path/to/data in ./test_ptsn.sh into $DataPath

  2. please run the code below:

    cd ./PTSN
    sh test_ptsn.sh

Training procedure

To train a Swin-B version of our PTSN model, do the following two steps:

  1. modify the /path/to/data in ./train_ptsn.sh into $DataPath
  2. please run the code below:
    cd ./PTSN
    sh train_ptsn.sh

    Note that it takes 4 v100 GPUs and around 50 hours to train this model.

Citation

To cite our paper, please use following BibTex:

@inproceedings{PTSN,
  author    = {Pengpeng Zeng and
               Jinkuan Zhu and
               Jingkuan Song and
               Lianli Gao},
  title     = {Progressive Tree-Structured Prototype Network for End-to-End Image
               Captioning},
  booktitle = {ACM MM},
  pages     = {5210--5218},
  year      = {2022},
}