Moohnai / MOFO

The main contribution is to make self-supervised video representation learning more meaningful by raising awareness of motion data
7 stars 1 forks source link

MOFO: MOtion FOcused Self-Supervision for Video Understanding [Arxiv]

MOFO Framework

MOFO: MOtion FOcused Self-Supervision for Video Understanding
Mona Ahmadian, [Frank Guerin](), [Andrew Gilbert]()
University of Surrey, Guildford, UK

📰 News

[2023.8.28] Code of the Automatic motion detection, MOFO self-supervision and MOFO finetuning are available now!

✨ Highlights

🔥 MOFO: MOtion FOcused Self-Supervision for Video Understanding

MOFO (MOtion FOcused) is a novel Self-supervised learning (SSL) method, for focusing representation learning on the motion area of a video, for action recognition and provides evidence that such a motion-focused technique could be effective in exploring motion information for enhancing motion-aware self-supervised video action recognition. MOFO automatically detects motion areas in videos and uses these to guide the self-supervision task. We use tube masking strategy and masked autoencoder which randomly masks out a high proportion of the input sequence (90%); we force a fixed percentage of the tubes (75\%) inside the motion area to be masked and the remainder from outside. We further incorporate motion information into the finetuning step to emphasise motion in the downstream task.

⚡️ A Motion-aware Baseline in SSVP

MOFO can serve as a motion-aware baseline for future research in self-supervised video pre-training and public code will guide many research directions.

MOFO's contributions are as follows:

High performance

MOFO works well for video datasets of different scales and can achieve 75.5% on Something-Something V2, 74.2%, 68.1%, 54.5% on Epic-Kitchens verb, noun and action repectively, only using ViT-Base backbones while doesn't need any extra data.

🚀 Main Results

MOFO* is pretrained by our MOFO SSL and uses non-MOFO finetuning.

MOFO** This is our result with pretraining on non-MOFO SSL and has MOFO finetuning.

MOFO† denotes the MOFO SSL and MOFO finetuning.

✨ Something-Something V2

Method Extra Data Backbone Resolution #Frames x Clips x Crops Top-1 Top-5
MOFO* no ViT-B 224x224 16x2x3 72.7 94.2
MOFO** no ViT-B 224x224 16x2x3 74.7 95.0
MOFO† no ViT-B 224x224 16x2x3 75.5 95.3

✨ Epic-Kitchens-verb

Method Extra Data Backbone Resolution #Frames x Clips x Crops Verb Top-1 Noun Top-1 Action Top-1
MOFO* no ViT-B 224x224 16x2x3 73.0 67.1 54.1
MOFO** no ViT-B 224x224 16x2x3 74.0 68.0 54.5
MOFO† no ViT-B 224x224 16x2x3 74.2 68.1 54.5

Pretrained Weights

Method Data Backbone Resolution Training Steps Link
MOFO* SSV2 ViT-B 224x224 250 Link
MOFO* EPIC KITCHENS ViT-B 224x224 250 Link
MOFO* EPIC KITCHENS ViT-B 224x224 800 Link

Finetune Weights

Method Data Backbone Training Steps Finetuning step Link
MOFO† EPIC KITCHENS (action) ViT-B 200 100 Link
MOFO† SSV2 ViT-B 200 100 Link

Setup

We used Python 3.8.13 and PyTorch 1.12.0 to train and test our models.

You can download and install (or update to) the latest release of MOFO with the following command:

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

DS_BUILD_OPS=1 pip install deepspeed

pip install timm==0.4.12

conda install -c conda-forge tensorboardx

pip install decord

conda install -c conda-forge einops

pip install opencv-python

pip install scipy

pip install pandas

conda install -c conda-forge mpi4py

pip install -U albumentations

Data Preparation

Please follow the instructions in DATASET.md for data preparation.

Pre-training

The pre-training instruction is in PRETRAIN_BB.md.

Fine-tuning with pre-trained models

The fine-tuning instruction is in FINETUNE_BB.md.

Contact

Mona Ahmadian: m.ahmadian@surrey.ac.uk

Acknowledgements

This project is built upon VideoMAE

✏️ Citation

If you think this project is helpful, please feel free to leave a star⭐️ and cite our paper.