A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Apache License 2.0
981
stars
64
forks
source link
The checkpoint for frame32 video action recognition is same with that for frame16. #31
Thank you for your excellent work!
I want to use the fine-tuned model for video recognition for downstream tasks, but I found the checkpoint for frame32 also gets the input shape of [1, 16, 1536]. I guess there might be a mistake and the checkpoints are both for frame16 model. The warning is as follows:
"size mismatch for backbone.image_adapter.temporal_embedding: copying a param with shape torch.Size([1, 16, 1536]) from checkpoint, the shape in current model is torch.Size([1, 32, 1536])."
Can you please upload the correct checkpoint for the frame32 setting? Thank you so much!
Thank you for your excellent work! I want to use the fine-tuned model for video recognition for downstream tasks, but I found the checkpoint for frame32 also gets the input shape of [1, 16, 1536]. I guess there might be a mistake and the checkpoints are both for frame16 model. The warning is as follows: "size mismatch for backbone.image_adapter.temporal_embedding: copying a param with shape torch.Size([1, 16, 1536]) from checkpoint, the shape in current model is torch.Size([1, 32, 1536])."
Can you please upload the correct checkpoint for the frame32 setting? Thank you so much!