MCG-NJU / VideoMAE

[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
https://arxiv.org/abs/2203.12602
Other
1.39k stars 137 forks source link

Overfitting in VideoMAE Model Fine-Tuning for Binary Classification on Home Camera Footage #129

Open tgcandido opened 3 weeks ago

tgcandido commented 3 weeks ago

Description: I'm fine-tuning a VideoMAE model for binary classification on home camera footage to distinguish between two actions. Here’s a summary of my setup and the issues I’m facing:

Dataset & Variations: I have two primary datasets:

Model & Configuration: The model classifies actions using 16 uniformly sampled frames per video. I’ve tried various base models, including small, base, large, and models fine-tuned on SSV2 and Kinect. Hyperparameters tested: Batch sizes of 2, 4, and 8. Epochs ranging from 4 to 16. Learning rate set to 5e-5.

I removed the RandomCrop transformation since it entirely removes the person from the video.

I'm using the Hugging Face Video Classification Colab Notebook as a starting point: Training Notebook.

Problem: Despite these variations, the model overfits immediately. I’ve also tested using the UCF101 dataset to rule out dataset-specific issues and got similar results to the Hugging Face VideoMAE colab, so the code seems fine.

Request: Any advice on addressing this overfitting issue would be greatly appreciated. Specifically, I'm looking for guidance on:

Thank you for any help or insights you can provide!