This is the implementation of our paper entitled "Action Knowledge for Video Captioning with Graph Neural Networks".
Our approach for video captioning introduces a new technique that leverages action as edge features within a graph neural network (GNN), with objects represented as nodes. By integrating object-action relationships into the GNN, our method enhances the visual representation and generates more precise captions. Furthermore, we enhance the performance by combining the proposed edge representation with a node representation based on grids. By overlapping the grids, the model captures more comprehensive information about the objects, leading to further improvements in performance.
It is demonstrated in the experiments of MSVD and MSR-VTT that our method improved video captioning quantitatively and qualitatively.
The illustration of our proposed action-graph model with overlapping grid is shown below:
Install and create conda environment with the provided environment.yml
file.
This conda environment was tested with the NVIDIA A6000 and NVIDIA RTX 3090.
The details of each dependency can be found in the environment.yml file.
conda env create -f environment.yml
conda activate action_graph_env
pip install git+https://github.com/Maluuba/nlg-eval.git@master
pip install pycocoevalcap
Install torch following this page: https://pytorch.org/get-started/locally
pip install opencv-python
pip install seaborn
├── dataset
│ ├── MSVD
│ │ ├── raw # put the 1970 raw videos in here
│ │ ├── captions
│ │ ├── raw-captions_mapped.pkl # mapping between video id with captions
│ │ ├── train_list_mapping.txt
│ │ ├── val_list_mapping.txt
│ │ ├── test_list_mapping.txt
│ ├── MSRVTT
│ │ ├── raw # put the 10000 raw videos in here
│ │ ├── msrvtt.csv # list of video id in msrvtt dataset
│ │ ├── MSRVTT_data.json # metadata of msrvtt dataset, which includes video url, video id, and caption
Raw videos can be downloaded from this link. We provided the captions in dataset/MSRVTT folder.
Raw videos can be downloaded from this link. We provided the captions in dataset/MSVD folder.
├── model
│ ├── i3d
├── modules # Copy the files from CLIP4Clip repository: https://github.com/ArrowLuo/CLIP4Clip/tree/master/modules
├── pretrained
│ ├── [trained CLIP4Clip model].bin # Train your own PyTorch CLIP4Clip model
│ ├── rgb_imagenet.pt # Download I3D model: https://github.com/piergiaj/pytorch-i3d/blob/master/models/rgb_imagenet.pt
├── utility # Some helper functions to generate the features
Note
- Please make sure you have copied modules from CLIP4Clip https://github.com/ArrowLuo/CLIP4Clip/tree/master/modules into feature_extractor/modules.
- Please make sure you have downloaded rgb_imagenet.pt into feature_extractor/pretrained.
- Please change args in each notebook based on requirement e.g,. args.msvd = False for MSR-VTT and args.msvd = True for MSVD.
Steps:
Steps:
Create the grid based action graph:
Steps:
mkdir modules/bert-model
cd modules/bert-model/
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
mv bert-base-uncased-vocab.txt vocab.txt
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz
tar -xvf bert-base-uncased.tar.gz
rm bert-base-uncased.tar.gz
cd ../../
mkdir -p ./weight
wget -P ./weight https://github.com/microsoft/UniVL/releases/download/v0/univl.pretrained.bin
scripts
, and change following parameters based on the specs of your machine:
cd scripts/
./msvd_train_GNN.sh
cd scripts/
./msrvtt_train_GNN.sh
After the training is done, an evaluation process on the test set will be automatically executed using the best checkpoint among all epochs. However, if you want to evaluate a checkpoint at a specific epoch, you can use the provided training shell script by modifying the value of INIT_MODEL_PATH
to the location of the desired checkpoint, and replacing the --do_train
to --do_eval
.
The comparison with the existing methods and also the ablation study of our method can be found in our paper.
Method | CLIP Model | BLEU@4 | METEOR | ROUGE-L | CIDEr |
---|---|---|---|---|---|
Ours (Action + Object) | ViT-B/32 | 62.56 | 41.53 | 78.62 | 120.64 |
Ours (Action + Grid) | ViT-B/32 | 62.90 | 41.81 | 78.80 | 119.07 |
Ours (Action + Grid) | ViT-B/16 | 64.07 | 42.41 | 79.72 | 124.18 |
Method | CLIP Model | BLEU@4 | METEOR | ROUGE-L | CIDEr |
---|---|---|---|---|---|
Ours (Action + Object) | ViT-B/32 | 48.31 | 31.35 | 65.34 | 60.00 |
Ours (Action + Grid) | ViT-B/32 | 49.10 | 31.57 | 65.52 | 61.27 |
Ours (Action + Grid) | ViT-B/16 | 51.02 | 32.19 | 66.55 | 63.02 |
Our code is developed based on https://github.com/microsoft/UniVL, which is also developed based on https://github.com/huggingface/transformers/tree/v0.4.0 and https://github.com/antoine77340/howto100m .
Please cite our paper in your publications if it helps your research as follows:
@article{Hendria2023,
author = {W. F. Hendria and V. Velda and B. H. H. Putra and F. Adzaka and C. Jeong},
title = {Action Knowledge for Video Captioning with Graph Neural Networks},
journal = {J. King Saud Univ.-Comput. Inf. Sci.},
volume = {35},
number = {4},
pages = {50-62},
month = apr,
year = {2023}"
}