Description:
I'm fine-tuning a VideoMAE model for binary classification on home camera footage to distinguish between two actions. Here’s a summary of my setup and the issues I’m facing:
Dataset & Variations:
I have two primary datasets:
Small Dataset: ~120 clips for quicker iteration.
Full Dataset: ~3k clips.
All videos are 6 seconds long, though I've also tested with 3-second clips.
I've also created variations with blurred or blacked-out backgrounds to help with recognition.
Model & Configuration:
The model classifies actions using 16 uniformly sampled frames per video.
I’ve tried various base models, including small, base, large, and models fine-tuned on SSV2 and Kinect.
Hyperparameters tested:
Batch sizes of 2, 4, and 8.
Epochs ranging from 4 to 16.
Learning rate set to 5e-5.
I removed the RandomCrop transformation since it entirely removes the person from the video.
I'm using the Hugging Face Video Classification Colab Notebook as a starting point: Training Notebook.
Problem: Despite these variations, the model overfits immediately. I’ve also tested using the UCF101 dataset to rule out dataset-specific issues and got similar results to the Hugging Face VideoMAE colab, so the code seems fine.
Request: Any advice on addressing this overfitting issue would be greatly appreciated. Specifically, I'm looking for guidance on:
Additional hyperparameter adjustments.
Potential model architecture changes (if applicable).
Dataset augmentation techniques that might improve generalization.
Thank you for any help or insights you can provide!
Description: I'm fine-tuning a VideoMAE model for binary classification on home camera footage to distinguish between two actions. Here’s a summary of my setup and the issues I’m facing:
Dataset & Variations: I have two primary datasets:
Model & Configuration: The model classifies actions using 16 uniformly sampled frames per video. I’ve tried various base models, including small, base, large, and models fine-tuned on SSV2 and Kinect. Hyperparameters tested: Batch sizes of 2, 4, and 8. Epochs ranging from 4 to 16. Learning rate set to 5e-5.
I removed the RandomCrop transformation since it entirely removes the person from the video.
I'm using the Hugging Face Video Classification Colab Notebook as a starting point: Training Notebook.
Problem: Despite these variations, the model overfits immediately. I’ve also tested using the UCF101 dataset to rule out dataset-specific issues and got similar results to the Hugging Face VideoMAE colab, so the code seems fine.
Request: Any advice on addressing this overfitting issue would be greatly appreciated. Specifically, I'm looking for guidance on:
Thank you for any help or insights you can provide!