The checkpoint for frame32 video action recognition is same with that for frame16.

OFA-Sys / ONE-PEACE

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Apache License 2.0

981 stars 64 forks source link

Thank you for your excellent work! I want to use the fine-tuned model for video recognition for downstream tasks, but I found the checkpoint for frame32 also gets the input shape of [1, 16, 1536]. I guess there might be a mistake and the checkpoints are both for frame16 model. The warning is as follows: "size mismatch for backbone.image_adapter.temporal_embedding: copying a param with shape torch.Size([1, 16, 1536]) from checkpoint, the shape in current model is torch.Size([1, 32, 1536])."

Can you please upload the correct checkpoint for the frame32 setting? Thank you so much!

OFA-Sys / ONE-PEACE

The checkpoint for frame32 video action recognition is same with that for frame16. #31