OpenGVLab / VideoMAEv2

[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
https://arxiv.org/abs/2303.16727
MIT License
493 stars 56 forks source link

Pre-train Action recognition videoMAE model on UCF101 #37

Closed Mr-MeerMoazzam closed 6 months ago

Mr-MeerMoazzam commented 1 year ago

Hi!! I'm interested in pre-training an action recognition model using videoMAEV2 on the UCF101 dataset. But I am having some difficulties and want assistance from you guys. Please help me.

Data Preparation: For Pretraining you have written

for video data line:
video_path 0 -1

What does mean by 0 and -1??? What are the next steps to pre-train the model on UCF101? What are some good practices that should be followed before going to pre-train the model??

congee524 commented 1 year ago

-1 means that the file on the video_path is a video. 0 is the start_idx, it's meaningless when the file is video, just a placeholder.

you just need to modify the video_path to the path of UCF101 videos.

Mr-MeerMoazzam commented 1 year ago

-1 means that the file on the video_path is a video. 0 represents the label, but labels are useless in pre-train, so we just place 0 on the position as the placeholder.

you just need to modify the video_path to the path of UCF101 videos.

Thanks for the quick response!! I really appreciate that. I have some more questions about to pretrain the model. What are the next steps after organizing the paths into the CSV??

Mr-MeerMoazzam commented 1 year ago

-1 means that the file on the video_path is a video. 0 is the start_idx, it's meaningless when the file is video, just a placeholder.

you just need to modify the video_path to the path of UCF101 videos.

You mean we don't need any labels for pretrain the model?

congee524 commented 1 year ago

Start training. You can see the doc of pre-train or follow the instructions in VideoMAE, we use the similar architecture with VideoMAE.

congee524 commented 1 year ago

-1 means that the file on the video_path is a video. 0 is the start_idx, it's meaningless when the file is video, just a placeholder. you just need to modify the video_path to the path of UCF101 videos.

You mean we don't need any labels for pretrain the model?

Sure. MAE is a self-supervised method

Mr-MeerMoazzam commented 1 year ago

I have a machine with the following specifications image and have RTX 3060. Is the system capable of pretraining on UCF101?

congee524 commented 1 year ago

I think not. You may need at least 8 GPUs

Mr-MeerMoazzam commented 1 year ago

Can I finetune it on the UCF101 with the system specs??

congee524 commented 1 year ago

you can try it using the distilled vit-small model.

Mr-MeerMoazzam commented 1 year ago

Could you please suggest to me which scripts from Finetunescripts I should follow to finetune it?

congee524 commented 1 year ago

Read the finetune doc and write a new one.

Mr-MeerMoazzam commented 1 year ago

What does mean by slurm train and Dist train?