In this paper, we leverage the human perceiving process, that involves vision and language interaction, to generate a coherent paragraph description of untrimmed videos. We propose vision-language (VL) features consisting of two modalities, i.e., (i) vision modality to capture global visual content of the entire scene and (ii) language modality to extract scene elements description of both human and non-human objects (e.g. animals, vehicles, etc), visual and non-visual elements (e.g. relations, activities, etc). Furthermore, we propose to train our proposed VLCap under a contrastive learning VL loss. The experiments and ablation studies on ActivityNet Captions and YouCookII datasets show that our VLCap outperforms existing SOTA methods on both accuracy and diversity metrics.
Clone this repository
git clone https://github.com/UARK-AICV/VLCAP.git
cd VLCAP
Prepare Conda environment
conda env create -f environment.yml
conda activate pytorch
PYTHONPATH
Note that you need to do this each time you start a new session.
source setup.sh
Download features from Google Drive: env feature and lang feature.
mkdir data/anet; cd data/anet
unzip anet_c3d
unzip anet_clip_b16
To train our MART model on ActivityNet Captions:
bash scripts/train.sh [anet/yc2] [true/false]
Here you can specify the dataset (ActivityNet:anet
or YouCook2:yc2
) and whether to use the proposed language feature (true
/false
).
Training log and model will be saved at results/anet_re_*
.
Once you have a trained model, you can follow the instructions below to generate captions.
Generate captions
bash scripts/translate_greedy.sh anet_re_* [val/test]
Replace anet_re_*
with your own model directory name.
The generated captions are saved at results/anet_re_*/greedy_pred_val.json
Evaluate generated captions
bash scripts/eval.sh anet [val/test] results/anet_re_*/greedy_pred_[val/test].json
The results should be comparable with the results we present at Table 5 of the paper.
If you find this code useful for your research, please cite our papers:
@INPROCEEDINGS{kashu_vlcap,
author={Yamazaki, Kashu and Truong, Sang and Vo, Khoa and Kidd, Michael and Rainwater, Chase and Luu, Khoa and Le, Ngan},
booktitle={2022 IEEE International Conference on Image Processing (ICIP)},
title={VLCAP: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning},
year={2022},
volume={},
number={},
pages={3656-3661},
doi={10.1109/ICIP46576.2022.9897766}}
@article{kashu_vltint,
title={VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning},
volume={37},
url={https://ojs.aaai.org/index.php/AAAI/article/view/25412},
DOI={10.1609/aaai.v37i3.25412},
number={3},
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
author={Yamazaki, Kashu and Vo, Khoa and Truong, Quang Sang and Raj, Bhiksha and Le, Ngan},
year={2023},
month={Jun.},
pages={3081-3090}
}
We acknowledge the following open-source projects that we based on our work: