OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Apache License 2.0
1.11k stars 73 forks source link

Wrong results in Action Recognition task. #133

Open zhengrongz opened 1 month ago

zhengrongz commented 1 month ago

Hi! I have tried Internvideo2-1B-clip in the action recognition task on K400 dataset, I try to use the model without the dataset class you designed. So what I do in vision is catching 8 frames from video, transform it using test_transform, feed the processed clip into the vision encoder to get the 1x768 feature. In text I just use the k400_categories.txt and kinetics_prompt you offered, after the text encoder it's 400x16x768 features. Finally I get these two features in get_sim, and get a rank of the categories, but the result is very bad. the answer is always not in the top5 choices, the model seems to randomly rank the categories. I don't know if there is any wrong. the model I use is chinese_alpaca_lora_7b, InternVideo2-stage2_1b-224p-f4.pt, internvl_c_13b_224px.pth, InternVideo2_CLIP_1B.pth.

zhengrongz commented 1 month ago

I also don't use flashattn, deepspeed,fused_rmsnorm and fused_mlp, but I don't think it will influence the inference result.

Andy1621 commented 1 month ago

Hi! Can you try to reproduce the results for some small dataset like UCF101? Thus you can check whether you have load the weights correctly.