fmu2 / snag_release

Official Implementation of SnAG (CVPR 2024)
21 stars 2 forks source link

SnAG: Scalable and Accurate Video Grounding (CVPR 2024)

Introduction

This code repo implements SnAG, a scalable and accurate model for long-form video grounding --- localizing moments within an untrimmed long video based on text descriptions. SnAG features a minimalist, late-fusion design for scalable inference, while supporting video-centric sampling for scalable training. Without bells and whistles, SnAG achieves 44.86% R1\@0.5 and 70.66% R5\@0.5 on TACoS, outperforming the previous state of the art by 8.53 and 12.75 absolute percentage points, respectively. Further, SnAG demonstrates strong results on Ego4D-NLQ (13.57% mean R1 and 32.92 mean R5) and the more challenging MAD dataset (5.55 R1\@0.5 and 13.75 R5\@0.5). Our paper is accepted to CVPR 2024 and an arXiv version can be found at this link.

Related projects:

ActionFormer: Localizing Moments of Actions with Transformers
Chenlin Zhang, Jianxin Wu, Yin Li
ECCV 2022
github github arXiv

Visualization

We provide visualizations of localized moments in Ego4D-NLQ videos.

Note that the ground-truth moments are determined by human annotations and subject to errors.

Changelog

Code Overview

The structure of this code repo is heavily inspired by ActionFormer. Some of the main components are

Installation

To Reproduce Our Results on TACoS

Download Features and Annotations

Details: The features are extracted using the C3D model pretrained on Sports1M, given clips of 16 frames with a frame rate of ~30 fps and a stride of 4 frames. This gives one feature vector per 4/30 ~= 0.1333 seconds. In practice, SnAG uses 4x-subsampled C3D features (i.e., the effective stride is 16 frames) for fair comparison with baselines.

Unpack Features and Annotations

Training and Evaluation

[Optional] Evaluating Our Pre-trained Model

We also provide a pre-trained model for TACoS. The model with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.

Method R1\@0.3 R1\@0.5 R5\@0.3 R5\@0.5
SnAG 55.51 45.14 81.58 70.31

To Reproduce Our Results on Ego4D-NLQ

Download Features and Annotations

Details: We use the official SlowFast features from here. They are extracted using the SlowFast model pretrained on Kinetics 400, given clips of 32 frames with a frame rate of 30 fps and a stride of 16 frames. This gives one feature vector per 16/30 ~= 0.533 seconds. The EgoVLP features are extracted using the EgoVLP model checkpoint, given clips of 32 frames with a frame rate of 30 fps and a stride of 8 frames. This gives one feature vector per 8/30 ~=0.267 seconds. In practice, SnAG uses 2x-subsampled EgoVLP features (i.e., the effective stride is 16 frames) for fair comparison with baselines.

Unpack Features and Annotations

Training and Evaluation

[Optional] Evaluating Our Pre-trained Model

We also provide pre-trained models for Ego4D-NLQ. The model using SlowFast + BERT features with all training logs can be downloaded from this Google Drive link. The model using EgoVLP features with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.

Method R1\@0.3 R1\@0.5 mean R1 R5\@0.3 R5\@0.5 mean R5
SnAG (SlowFast + BERT) 9.75 6.40 8.08 28.10 19.47 23.79
SnAG (EgoVLP) 15.53 10.94 13.24 38.40 27.70 33.10

To Reproduce Our Results on MAD

Download Features and Annotations

Details: We use the official CLIP features from here. The features are extracted using CLIP ViT-L/14 with a frame rate of 5 fps. This gives one feature vector every 0.2 seconds.

Unpack Features and Annotations

Training and Evaluation

[Optional] Evaluating Our Pre-trained Model

We also provide a pre-trained model for MAD. The model with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.

Method R1\@0.1 R1\@0.3 R1\@0.5 R5\@0.1 R5\@0.3 R5\@0.5
SnAG 10.35 8.51 5.47 24.40 20.30 13.41

To Reproduce Our Results on Charades-STA

Download Features and Annotations

Details: The C3D features are extracted using the C3D model pretrained on Sports1M, given clips of 16 frames with a frame rate of 24 fps and a stride of 4 frames. This gives one feature vector per 4/24 ~= 0.167 seconds. The I3D features are extracted using the I3D model pretrained on Kinetics 400, given clips of 16 frames with a frame rate of 24 fps and a stride of 4 frames. This gives one feature vector per 4/24 ~= 0.167 seconds.

Unpack Features and Annotations

Training and Evaluation

[Optional] Evaluating Our Pre-trained Model

We also provide pre-trained models for Charades-STA. The model using C3D features with all training logs can be downloaded from this Google Drive link. The model using I3D features with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.

Method R1\@0.5 R1\@0.7 R5\@0.5 R5\@0.7
SnAG (C3D) 51.75 33.33 90.83 65.56
SnAG (I3D) 65.19 46.32 93.04 73.12

To Reproduce Our Results on ActivityNet-Captions

Download Features and Annotations

Details: We use the official C3D features from here. The features are extracted using the C3D model pretrained on Sports1M, given clips of 16 frames and a stride of 8 frames. The frame rate is unknown. The feature dimension has been reduced from 4096 to 500 using PCA.

Unpack Features and Annotations

Training and Evaluation

[Optional] Evaluating Our Pre-trained Model

We also provide a pre-trained model for ActivityNet-Captions. The model with all training logs can be downloaded from this Google Drive link. To evaluate the pre-trained model, please follow the steps listed below.

Method R1\@0.5 R1\@0.7 R5\@0.5 R5\@0.7
SnAG 47.44 29.89 82.60 63.29

Backup Links

Contact

Fangzhou Mu (fmu2@wisc.edu)

Reference

If you are using our code, please consider citing our paper.

@inproceedings{mu2024snag,
  title={{SnAG}: Scalable and Accurate Video Grounding},
  author={Mu, Fangzhou and Mo, Sicheng and Li, Yin},
  booktitle={CVPR},
  year={2024}
}