MOFO: MOtion FOcused Self-Supervision for Video Understanding
Mona Ahmadian, [Frank Guerin](), [Andrew Gilbert]()
University of Surrey, Guildford, UK
[2023.8.28] Code of the Automatic motion detection, MOFO self-supervision and MOFO finetuning are available now!
MOFO (MOtion FOcused) is a novel Self-supervised learning (SSL) method, for focusing representation learning on the motion area of a video, for action recognition and provides evidence that such a motion-focused technique could be effective in exploring motion information for enhancing motion-aware self-supervised video action recognition. MOFO automatically detects motion areas in videos and uses these to guide the self-supervision task. We use tube masking strategy and masked autoencoder which randomly masks out a high proportion of the input sequence (90%); we force a fixed percentage of the tubes (75\%) inside the motion area to be masked and the remainder from outside. We further incorporate motion information into the finetuning step to emphasise motion in the downstream task.
MOFO can serve as a motion-aware baseline for future research in self-supervised video pre-training and public code will guide many research directions.
MOFO's contributions are as follows:
The Automatic motion area detection using motion maps driven by optical flows, but invariant to camera motion.
motion-aware SSL approach, which focuses masking on the motion area in the video, using our proposed automatic motion detection algorithm.
A motion-focused finetuning technique to further intensify the focus on the motion area for the action recognition task.
MOFO works well for video datasets of different scales and can achieve 75.5% on Something-Something V2, 74.2%, 68.1%, 54.5% on Epic-Kitchens verb, noun and action repectively, only using ViT-Base backbones while doesn't need any extra data.
MOFO* is pretrained by our MOFO SSL and uses non-MOFO finetuning.
MOFO** This is our result with pretraining on non-MOFO SSL and has MOFO finetuning.
MOFO† denotes the MOFO SSL and MOFO finetuning.
Method | Extra Data | Backbone | Resolution | #Frames x Clips x Crops | Top-1 | Top-5 |
---|---|---|---|---|---|---|
MOFO* | no | ViT-B | 224x224 | 16x2x3 | 72.7 | 94.2 |
MOFO** | no | ViT-B | 224x224 | 16x2x3 | 74.7 | 95.0 |
MOFO† | no | ViT-B | 224x224 | 16x2x3 | 75.5 | 95.3 |
Method | Extra Data | Backbone | Resolution | #Frames x Clips x Crops | Verb Top-1 | Noun Top-1 | Action Top-1 |
---|---|---|---|---|---|---|---|
MOFO* | no | ViT-B | 224x224 | 16x2x3 | 73.0 | 67.1 | 54.1 |
MOFO** | no | ViT-B | 224x224 | 16x2x3 | 74.0 | 68.0 | 54.5 |
MOFO† | no | ViT-B | 224x224 | 16x2x3 | 74.2 | 68.1 | 54.5 |
Method | Data | Backbone | Resolution | Training Steps | Link |
---|---|---|---|---|---|
MOFO* | SSV2 | ViT-B | 224x224 | 250 | Link |
MOFO* | EPIC KITCHENS | ViT-B | 224x224 | 250 | Link |
MOFO* | EPIC KITCHENS | ViT-B | 224x224 | 800 | Link |
Method | Data | Backbone | Training Steps | Finetuning step | Link |
---|---|---|---|---|---|
MOFO† | EPIC KITCHENS (action) | ViT-B | 200 | 100 | Link |
MOFO† | SSV2 | ViT-B | 200 | 100 | Link |
We used Python 3.8.13 and PyTorch 1.12.0 to train and test our models.
You can download and install (or update to) the latest release of MOFO with the following command:
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
DS_BUILD_OPS=1 pip install deepspeed
pip install timm==0.4.12
conda install -c conda-forge tensorboardx
pip install decord
conda install -c conda-forge einops
pip install opencv-python
pip install scipy
pip install pandas
conda install -c conda-forge mpi4py
pip install -U albumentations
Please follow the instructions in DATASET.md for data preparation.
The pre-training instruction is in PRETRAIN_BB.md.
The fine-tuning instruction is in FINETUNE_BB.md.
Mona Ahmadian: m.ahmadian@surrey.ac.uk
This project is built upon VideoMAE
If you think this project is helpful, please feel free to leave a star⭐️ and cite our paper.