Mug-STAN

Official PyTorch implementation of the paper "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring" and "Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding"

The original code is based on mmcv1.4. Due to all the data processing pipelines being built on private I/O, the training code cannot be open-sourced. Therefore, we have reproduced the results using mmcv2.0.

Pretrained Weights:

Getting Started

Installation

Git clone our repository, creating a python environment and activate it via the following command

git clone https://github.com/farewellthree/STAN.git
cd STAN
conda create --name stan python=3.10
conda activate stan
bash install.sh

Prepare Datasets

You can follow CLIP4clip for the acquisition of videos and annotation.

Once the dataset is already, set the path in each config. Take stan-b/32 on MSRVTT for instance, set video path here at Line 25.

Considering there might be multiple versions of annotations for the dataset, our code may not be compatible with your annotations. In such cases, you just need to modify the corresponding dataset class in video_text_dataset.py, to output the paths of all videos along with their corresponding captions.

Training

STAN

To train stan-b/32 on MSRVTT, run

torchrun --nproc_per_node=8 --master_port=20001 tools/train.py configs/exp/stan/stan_msrvtt_b32_hf.py --launcher pytorch

The same principle applies to other datasets or models in terms of scale.

Mug-STAN

To train mug-stan-b/32 on MSRVTT, run

torchrun --nproc_per_node=8 --master_port=20001 tools/train.py configs/exp/stan/mugstan_msrvt_b32_hf.py --launcher pytorch

The same principle applies to other datasets or models in terms of scale.

Post-Pretraining

To post-pretraining mug-stan-b/32 on Webvid10m, run

torchrun --nproc_per_node=16 --master_port=20001 tools/train.py configs/exp/stan/mugstan_webvid10m_b32_pretrain.py --launcher pytorch

Citation

If you find the code useful for your research, please consider citing our paper:

@article{liu2023revisiting,
  title={Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring},
  author={Liu, Ruyang and Huang, Jingjia and Li, Ge and Feng, Jiashi and Wu, Xinglong and Li, Thomas H},
  journal={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2023}
}

farewellthree / STAN

readme

Mug-STAN

Getting Started

Installation

Prepare Datasets

Training

STAN

Mug-STAN

Post-Pretraining

Citation