OFA-Sys / ONE-PEACE

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Apache License 2.0
981 stars 64 forks source link

The checkpoint for frame32 video action recognition is same with that for frame16. #31

Closed ttgeng233 closed 1 year ago

ttgeng233 commented 1 year ago

Thank you for your excellent work! I want to use the fine-tuned model for video recognition for downstream tasks, but I found the checkpoint for frame32 also gets the input shape of [1, 16, 1536]. I guess there might be a mistake and the checkpoints are both for frame16 model. The warning is as follows: "size mismatch for backbone.image_adapter.temporal_embedding: copying a param with shape torch.Size([1, 16, 1536]) from checkpoint, the shape in current model is torch.Size([1, 32, 1536])."

Can you please upload the correct checkpoint for the frame32 setting? Thank you so much!

simonJJJ commented 1 year ago

Hi @ttgeng233, to clarify this, there is only one frame16 model. We do not train on frame32, just test the frame16 model under the frame32 setting.

Feel free to drop other questions.