Overfitting in VideoMAE Model Fine-Tuning for Binary Classification on Home Camera Footage

Description: I'm fine-tuning a VideoMAE model for binary classification on home camera footage to distinguish between two actions. Here’s a summary of my setup and the issues I’m facing:

Dataset & Variations: I have two primary datasets:

Small Dataset: ~120 clips for quicker iteration.
Full Dataset: ~3k clips. All videos are 6 seconds long, though I've also tested with 3-second clips. I've also created variations with blurred or blacked-out backgrounds to help with recognition.

Model & Configuration: The model classifies actions using 16 uniformly sampled frames per video. I’ve tried various base models, including small, base, large, and models fine-tuned on SSV2 and Kinect. Hyperparameters tested: Batch sizes of 2, 4, and 8. Epochs ranging from 4 to 16. Learning rate set to 5e-5.

I removed the RandomCrop transformation since it entirely removes the person from the video.

I'm using the Hugging Face Video Classification Colab Notebook as a starting point: Training Notebook.

Problem: Despite these variations, the model overfits immediately. I’ve also tested using the UCF101 dataset to rule out dataset-specific issues and got similar results to the Hugging Face VideoMAE colab, so the code seems fine.

Request: Any advice on addressing this overfitting issue would be greatly appreciated. Specifically, I'm looking for guidance on:

Additional hyperparameter adjustments.
Potential model architecture changes (if applicable).
Dataset augmentation techniques that might improve generalization.

Thank you for any help or insights you can provide!

MCG-NJU / VideoMAE

Overfitting in VideoMAE Model Fine-Tuning for Binary Classification on Home Camera Footage #129