This repository includes the Pytorch code for our paper "Comprehensive Image Captioning via Scene Graph Decomposition" in ECCV 2020.
Python and Pytorch can be installed by anaconda, run
conda create --name ENV_NAME python=3
source activate ENV_NAME
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
where ENV_NAME
and cudatoolkit version can be specified by your own.
For the other dependencies, run pip install -r requirements.txt
to install.
Check DATA.md for instructions of data downloading.
To train our image captioning models, run the script
bash train.sh MODEL_TYPE
by replacing MODEL_TYPE
with one of [Sub_GC_MRNN, Sub_GC_Kar, Full_GC_Kar, Sub_GC_Flickr, Sub_GC_Sup_Flickr]
. MODEL_TYPE
specifies the dataset, the data split and the model used for training. See details below.
COCO Caption Dataset
Sub_GC_MRNN
: train a sub-graph captioning model on M-RNN split (Table 2 in our paper)Sub_GC_Kar
: train a sub-graph captioning model on Karpathy split (Table 3 in our paper)Full_GC_Kar
: train a full-graph captioning model on Karpathy split (Table 3 in our paper)Flickr30K Dataset
Sub_GC_Flickr
: train a sub-graph captioning model (Table 4 & 5 in our paper)Sub_GC_Sup_Flickr
: train a supervised sub-graph captioning model (Table 5 in our paper)You can set CUDA_VISIBLE_DEVICES
in train.sh
to specify which GPUs are used for model training (e.g., the default script uses 2 GPUs).
The evaluation is divided into 2 steps
To generate captions, run the script
bash test.sh MODEL_TYPE
by replacing MODEL_TYPE
with one of [Sub_GC_MRNN, Sub_GC_S_MRNN, Sub_GC_Kar, Full_GC_Kar, Sub_GC_Flickr, Sub_GC_Flickr_GRD, Sub_GC_Flickr_CTL, Sub_GC_Sup_Flickr_CTL]
. MODEL_TYPE
specifies the dataset, the data split and the model used for sentence generation. See details below.
COCO Caption Dataset
Sub_GC_MRNN
: use the sub-graph captioning model (Sub-GC) on M-RNN split (Table 2 in our paper)Sub_GC_S_MRNN
: use Sub-GC with top-k sampling (Sub-GC-S) on M-RNN split (Table 2 in our paper)Sub_GC_Kar
: use the sub-graph captioning model (Sub-GC) on Karpathy split (Table 3 in our paper)Full_GC_Kar
: use the full graph captioning model (Full-GC) on Karpathy split (Table 3 in our paper)Flickr30K Dataset
Sub_GC_Flickr
: use Sub-GC for top-1 caption accuracy evaluation (Table 4 in our paper)Sub_GC_Flickr_GRD
: use Sub-GC for grounding evaluation (Table 4 in our paper)Sub_GC_Flickr_CTL
: use Sub-GC for controllability evaluation (Table 5 in our paper)Sub_GC_Sup_Flickr_CTL
: use Sub-GC (Sup.) for controllability evaluation (Table 5 in our paper)The inference results will be saved in a captions_*.npy
file at the same folder as the model checkpoint (e.g., pretrained/sub_gc_MRNN
). $CAPTION_FILE
will be used as the name of generated captions_*.npy
file in the following instructions.
Move the generated $CAPTION_FILE
into folder misc/diversity
and run
cd misc/diversity
python diversity_score.py --input_file $CAPTION_FILE
To evaluate the metric of mBLEU-4 (takes much longer time than other metrics), run
cd misc/diversity
python diversity_score.py --input_file $CAPTION_FILE --evaluate_mB4
In our paper, we report the top-1 accuracy of the best caption selected by sGPN+consensus. To reproduce the results, move the generated $CAPTION_FILE
into folder misc/consensus_reranking/hypotheses_mRNN
and run:
cd misc/consensus_reranking
python cr_mRNN_demo.py --input_file $CAPTION_FILE --dataset coco --split MRNN --top_k 4
This will apply consensus reranking on the top 4 captions selected by our sGPN scores as described in our paper. The arguments of --dataset
and --split
specify the dataset (coco
or flickr30k
) and the split (MRNN
or karpathy
), respectively.
If you want to evaluate the top-1 caption selected by our sGPN or the top-1 accuracy for Full-GC, set --only_sent_eval
to 1
in test.sh
and rerun the bash file. If you want to evaluate the oracle scores which will take a few hours, set --only_sent_eval
to 1
and add --orcle_num 1000
in test.sh
, and rerun the bash file.
In our paper, we report the grounding scores of the best caption selected by sGPN+consensus. To reproduce the results, this section requires 3 substeps:
Select the best caption by consensus reranking: use our sub-graph captioning model to generate captions (bash test.sh Sub_GC_Flickr_GRD
), and apply consensus reranking on the top generated captions (see instruction in the section of Top-1 Accuracy Evaluation). A file named consensus_rerank_ind.npy
that contains the ranking indices will be generated at misc/consensus_reranking
.
Collect the grounding results for the best caption: move consensus_rerank_ind.npy
into the same folder of the model checkpoint (e.g., pretrained/sub_gc_flickr
). Run bash test.sh Sub_GC_Flickr_GRD
again and grounding_file.json
that contains the grounding results will be generated at the same folder of the model checkpoint.
Evaluate the grounding results: move grounding_file.json
into misc/grounding
and run cd misc/grounding; python grounding_score.py
.
This section follows the implementation from grounding evaluation, which evaluates the grounding performance without beam search. To this end, we disable beam search for the grounding evaluation.
After running bash test.sh MODEL_TYPE
with MODEL_TYPE
as Sub_GC_Flickr_CTL
or Sub_GC_Sup_Flickr_CTL
, an output file $CTL_CAPTION_FILE
(e.g., ctl_captions_*.npy
) will be generated and locate at the same folder as the model checkpoint (e.g., pretrained/sub_gc_sup_flickr
). This output file stores the predicted captions which are ready for controllability evaluation.
To obtain the controllability scores, move that output file into folder misc/controllability
and run
cd misc/controllability
python controllability_score.py --input_file $CTL_CAPTION_FILE
This repository was built based on Ruotian Luo's implementation for image captioning and Graph-RCNN. Partial evaluation protocols were implemented based on several code repositories, including: coco-caption, consensus reranking, grounding evaluation, and controllability evaluation.
If you are using our code, please consider citing our paper.
@inproceedings{zhong2020comprehensive,
title={Comprehensive Image Captioning via Scene Graph Decomposition},
author={Zhong, Yiwu and Wang, Liwei and Chen, Jianshu and Yu, Dong and Li, Yin},
booktitle={ECCV},
year={2020}
}