The official repository of our paper "Align and Attend: Multimodal Summarization with Dual Contrastive Losses".
You can install the conda environment by running:
conda create -n a2summ python=3.8.13
conda activate a2summ
pip install torch==1.12.0+cu113 torchvision==0.13.0+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
pip install tensorboard
pip install rouge-score==0.1.2
pip install scipy ortools h5py pyyaml
We evaluate our A2Summ on two multimodal summarization multimodal output datasets (CNN, Daily_Mail) and two standard video summarization datasets (SumMe, TVSum).
We also collected a large-scale multimodal summarization dataset BLiSS which consists of livestream videos and transcripts with annotated summary.
Before running the code, please download the pre-processed datasets from google drive link.
Unzip it under the data/
folder and make sure the data structure is as below.
├── data
└── BLiSS
├── annotation
├── feature
└── CNN
├── annotation
├── feature
└── Daily_Mail
├── annotation
├── feature
└── SumMe
├── caption
├── feature
├── splits.yml
└── TVSum
├── caption
├── feature
├── splits.yml
For the BLiSS dataset, due to the copyright issue, we only provide the extracted video/thumbnail features instead of the original videos/thunmbnails. If you need access to the original videos, please email me (bohe@umd.edu) for the public URLs of each video.
We train the model on a single GTX-1080ti GPU. To train the model on different dataset, please execute the following command.
python train.py --dataset ${dataset}
First, download the checkpoints into "saved_model" directory and pass it as the checkpoint flag.
python train.py --dataset ${dataset} \
--test --checkpoint saved_model/${dataset}
If you find our code or our paper useful for your research, please [★star] this repo and [cite] the following paper:
@inproceedings{he2023a2summ,
title = {Align and Attend: Multimodal Summarization with Dual Contrastive Losses},
author={He, Bo and Wang, Jun and Qiu, Jielin and Bui, Trung and Shrivastava, Abhinav and Wang, Zhaowen},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2023}
}
We referenced the repos below for the code