Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition [ICCV 2023]

Syed Talal Wasim*, Muhammad Uzair Khattak*, Muzammal Naseer, Salman Khan, Mubarak Shah, Fahad Shahbaz Khan

*Joint first authors

Abstract: Recent video recognition models utilize Transformer models for long-range spatio-temporal context modeling. Video transformer designs are based on self-attention that can model global context at a high computational cost. In comparison, convolutional designs for videos offer an efficient alternative but lack long-range dependency modeling. Towards achieving the best of both designs, this work proposes Video-FocalNet, an effective and efficient architecture for video recognition that models both local and global contexts. Video-FocalNet is based on a spatio-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention for better efficiency. Further, the aggregation step and the interaction step are both implemented using efficient convolution and element-wise multiplication operations that are computationally less expensive than their self-attention counterparts on video representations. We extensively explore the design space of focal modulation-based spatio-temporal context modeling and demonstrate our parallel spatial and temporal encoding design to be the optimal choice. Video-FocalNets perform favorably well against the state-of-the-art transformer-based models for video recognition on three large-scale datasets (Kinetics-400, Kinetics-600, and SS-v2) at a lower computational cost.

News
Overview
Visualization
Environment Setup
Dataset Preparation
Model Zoo
Evaluation
Training
Citation
Acknowledgements

:rocket: News

(July 13, 2023)
- Training and evaluation codes for Video-FocalNets, along with pretrained models are released.

Overview

Overall Architecture

(a) The overall architecture of Video-FocalNets: A four-stage architecture, with each stage comprising a patch embedding and a number of Video-FocalNet blocks. (b) Single Video-FocalNet block: Similar to the transformer blocks, we replace self-attention with Spatio-Temporal Focal Modulation.


The Spatio-Temporal Focal Modulation layer: A spatio-temporal focal modulation block that independently models the spatial and temporal information.	Comparison for Top-1 Accuracy vs GFlops/view on Kinetics-400.

Visualization: First and Last layer Spatio-Temporal Modulator

Visualization Cutting Apple

Visualization Scuba Diving

Visualization Threading Needle

Visualization Walking the Dog

Visualization Water Skiing

Environment Setup

Please follow INSTALL.md for installation.

Dataset Preparation

Please follow DATA.md for data preparation.

Model Zoo

Kinetics-400

Model	Depth	Dim	Kernels	Top-1	Download
Video-FocalNet-T	[2,2,6,2]	96	[3,5]	79.8	ckpt
Video-FocalNet-S	[2,2,18,2]	96	[3,5]	81.4	ckpt
Video-FocalNet-B	[2,2,18,2]	128	[3,5]	83.6	ckpt

Kinetics-600

Model	Depth	Dim	Kernels	Top-1	Download
Video-FocalNet-B	[2,2,18,2]	128	[3,5]	86.7	ckpt

Something-Something-v2

Model	Depth	Dim	Kernels	Top-1	Download
Video-FocalNet-B	[2,2,18,2]	128	[3,5]	71.1	ckpt

Diving-48

Model	Depth	Dim	Kernels	Top-1	Download
Video-FocalNet-B	[2,2,18,2]	128	[3,5]	90.8	ckpt

ActivityNet-v1.3

Model	Depth	Dim	Kernels	Top-1	Download
Video-FocalNet-B	[2,2,18,2]	128	[3,5]	89.8	ckpt

Evaluation

To evaluate pre-trained Video-FocalNets on your dataset:

python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use>  main.py  --eval \
--cfg <config-file> --resume <checkpoint> \
--opts DATA.NUM_FRAMES 8 DATA.BATCH_SIZE 8 TEST.NUM_CLIP 4 TEST.NUM_CROP 3 DATA.ROOT path/to/root DATA.TRAIN_FILE train.csv DATA.VAL_FILE val.csv

For example, to evaluate the Video-FocalNet-B with a single GPU on Kinetics400:

python -m torch.distributed.launch --nproc_per_node 1  main.py  --eval \
--cfg configs/kinetics400/video_focalnet_base.yaml --resume video-focalnet_base_k400.pth \
--opts DATA.NUM_FRAMES 8 DATA.BATCH_SIZE 8 TEST.NUM_CLIP 4 TEST.NUM_CROP 3 DATA.ROOT path/to/root DATA.TRAIN_FILE train.csv DATA.VAL_FILE val.csv

Alternatively, the DATA.ROOT, DATA.TRAIN_FILE, and DATA.VAL_FILE paths can be set directly in the config files provided in the configs directory. According to our experience and sanity checks, there is a reasonable random variation of about +/-0.3% top-1 accuracy when testing on different machines.

Additionally, the TRAIN.PRETRAINED_PATH can be set (either in the config file or bash script) to provide a pretrained model to initialize the weights. To initialize from the ImageNet-1K weights please refer to the FocalNets repository and download the FocalNet-T-SRF, FocalNet-S-SRF or FocalNet-B-SRF to initialize Video-FocalNet-T, Video-FocalNet-S or Video-FocalNet-B respectively. Alternatively, one of the provided pretrained Video-FocalNet models can also be utilized to initialize the weights.

Training

To train a Video-FocalNet on a video dataset from scratch, run:

python -m torch.distributed.launch --nproc_per_node <num-of-gpus-to-use>  main.py \
--cfg <config-file> --batch-size <batch-size-per-gpu> --output <output-directory> \
--opts DATA.ROOT path/to/root DATA.TRAIN_FILE train.csv DATA.VAL_FILE val.csv

Alternatively, the DATA.ROOT, DATA.TRAIN_FILE, and DATA.VAL_FILE paths can be set directly in the config files provided in the configs directory. We also provide bash scripts to train Video-FocalNets on various datasets in the scripts directory.

Citation

If you find our work, this repository, or pretrained models useful, please consider giving a star :star: and citation.

@InProceedings{Wasim_2023_ICCV,
    author    = {Wasim, Syed Talal and Khattak, Muhammad Uzair and Naseer, Muzammal and Khan, Salman and Shah, Mubarak and Khan, Fahad Shahbaz},
    title     = {Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    year      = {2023},
}

Contact

If you have any questions, please create an issue on this repository or contact at syed.wasim@mbzuai.ac.ae or uzair.khattak@mbzuai.ac.ae.

Acknowledgements

Our code is based on FocalNets, XCLIP and UniFormer repositories. We thank the authors for releasing their code. If you use our model, please consider citing these works as well.

TalalWasim / Video-FocalNets

readme