Soldelli / VLG-Net

VLG-Net: Video-Language Graph Matching Networks for Video Grounding
MIT License
30 stars 1 forks source link

VLG-Net: Video-Language Graph Matching Networks for Video Grounding

Introduction

Official repository for VLG-Net: Video-Language Graph Matching Networks for Video Grounding. [ICCVW Paper]

The paper is accepted to the first edition fo the ICCV workshop: AI for Creative Video Editing and Understanding (CVEU).

architecture

Installation

Clone the repository and move to folder:

git clone https://github.com/Soldelli/VLG-Net.git
cd VLG-Net

Install environmnet:

conda env create -f environment.yml

If installation fails, please follow the instructions in file doc/environment.md (link).

Data

Download the following resources and extract the content in the appropriate destination folder. See table.

Resource Download Link File Size Destination Folder
StandfordCoreNLP-4.0.0 link (~0.5GB) ./datasets/
TACoS link (~0.5GB) ./datasets/
ActivityNet-Captions link (~29GB) ./datasets/
DiDeMo link (~13GB) ./datasets/
GCNeXt warmup link (~0.1GB) ./datasets/
Pretrained Models link (~0.1GB) ./models/


The folder structure should be as follows:

.
├── configs
│
├── datasets
│   ├── activitynet1.3
│   │    ├── annotations
│   │    └── features
│   ├── didemo
│   │    ├── annotations
│   │    └── features
│   ├── tacos
│   │    ├── annotations
│   │    └── features
│   ├── gcnext_warmup
│   └── standford-corenlp-4.0.0
│
├── doc
│
├── lib
│   ├── config
│   ├── data
│   ├── engine
│   ├── modeling
│   ├── structures
│   └── utils
│
├── models
│   ├── activitynet
│   └── tacos
│
├── outputs
│
└── scripts

Training

Copy paste the following commands in the terminal.

Load environment:

conda activate vlg

Evaluation

For simplicity we provide scripts to automatically run the inference on pretrained models. See script details if you want to run inference on a different model.

Load environment:

conda activate vlg

Then run one of the following scripts to launch the evaluation.

Expected results:

After cleaning the code and fixing a couple of minor bugs, performance changed (slightly) with respect to reported numbers in the paper. See below table.

ActivityNet Rank1@0.5 Rank1@0.7 Rank5@0.5 Rank5@0.7
Paper 46.32 29.82 77.15 63.33
Current 46.32 29.79 77.19 63.36


TACoS Rank1@0.1 Rank1@0.3 Rank1@0.5 Rank5@0.1 Rank5@0.3 Rank5@0.5
Paper 57.21 45.46 34.19 81.80 70.38 56.56
Current 57.16 45.56 34.14 81.48 70.13 56.34


Citation

If any part of our paper and code is helpful to your work, please cite with:

@inproceedings{soldan2021vlg,
  title={VLG-Net: Video-language graph matching network for video grounding},
  author={Soldan, Mattia and Xu, Mengmeng and Qu, Sisi and Tegner, Jesper and Ghanem, Bernard},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={3224--3234},
  year={2021}
}