To reproduce the results reported in the paper, just simply run
bash eval_flickr.sh
fro Flickr30k-Entities and
bash eval_coco.sh
for MSCOCO.
In the first training stage, run like
python train.py --id CE-scan-sup-0.1kl --caption_model topdown --input_json data/flickrtalk.json --input_fc_dir data/flickrbu/flickrbu_fc --input_att_dir data/flickrbu/flickrbu_att --input_box_dir data/flickrbu/flickrbu_box --input_label_h5 data/flickrtalk_label.h5 --batch_size 29 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log/CE-scan-sup-0.1kl --save_checkpoint_every 1000 --val_images_use -1 --max_epochs 30 --att_supervise True --att_supervise_weight 0.1
In the second training stage, run like
python train.py --id sc-ground-CE-scan-sup-0.1kl --caption_model topdown --input_json data/flickrtalk.json --input_fc_dir data/flickrbu/flickrbu_fc --input_att_dir data/flickrbu/flickrbu_att --input_box_dir data/flickrbu/flickrbu_box --input_label_h5 data/flickrtalk_label.h5 --batch_size 29 --learning_rate 5e-5 --start_from log/CE-scan-sup-0.1kl --checkpoint_path log/sc-ground-CE-scan-sup-0.1kl --save_checkpoint_every 1000 --language_eval 1 --val_images_use -1 --self_critical_after 30 --max_epochs 110 --cider_reward_weight 1
--ground_reward_weight 1
@inproceedings{zhou2020grounded,
title={More Grounded Image Captioning by Distilling Image-Text Matching Model},
author={Zhou, Yuanen and Wang, Meng and Liu, Daqing and Hu, Zhenzhen and Zhang, Hanwang},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
year={2020}
}
This repository is built upon self-critical.pytorch, SCAN and grounded-video-description. Thanks for their released code.