JonghwanMun / LGI4temporalgrounding

Repository for the CVPR-20 paper "Local-Global Video-Text Interactions for Temporal Grounding"
130 stars 18 forks source link

Local-Global Video-Text Interactions for Temporal Grounding

PyTorch implementation of Local-Global Interaction (LGI) network for temporal grounding given a text query.

Local-Global Video-Text Interactions for Temporal Grounding

Jonghwan Mun, Minsu Cho, Bohyung Han

model_overview Overall architecture of our algorithm (LGI). Given a video and a text query, we encode them to obtain segment-level visual features, word-level and sentence-level textual features. We extract a set of semantic phrase features from the query using the Sequential Query Attention Network (SQAN). Then, we obtain semantics-aware segment features based on the extracted phrase features via local-global video-text interactions. Finally, we directly predict the time interval from the summarized video features using the temporal attention. We train the model using the regression loss and two additional attention-related losses.

1. Dependencies

This repository is implemented based on PyTorch with Anaconda.
Refer to Setting environment with anaconda or use Docker (choco1916/envs:temporal_grounding).

2. Prepare data

Running scripts/prepare_data.sh will download all data including annotations, video features (I3D for Charades-STA, C3D for ActivityNet Captions), pre-processed annotation information.

bash scripts/prepare_data.sh

3. Evaluating pre-trained models

Dataset R@0.3 R@0.5 R@0.7 mIoU
ActivityNet Captions 58.48 41.65 24.10 41.48
Charades-STA 72.18 59.17 35.32 50.93

4. Training models from scratch

This code will load all the data (~30GB for ActivityNet Captions and ~3GB for Charades-STA) into RAM for the fast training, if you want to disable this behavior, set in_memory in a config file (config.yaml) as FALSE.

5. Visualization

For the visualization, we need moviepy package as well as raw videos.

# Path to directory for raw videos
ActivityNet Captions: data/anet/raw_videos/validation/
Charades-STA: data/charades/raw_videos/

Refer to visualization.ipynb

6. Citation

If you use this code in a publication, please cite our paper.

@inproceedings{mun2020LGI,
    title     = "{Local-Global Video-Text Interactions for Temporal Grounding}",
    author    = {Mun, Jonghwan and Cho, Minsu and and Han, Bohyung},
    booktitle = {CVPR},
    year      = {2020}
}