Open zhengrongz opened 1 month ago
I also don't use flashattn, deepspeed,fused_rmsnorm and fused_mlp, but I don't think it will influence the inference result.
Hi! Can you try to reproduce the results for some small dataset like UCF101? Thus you can check whether you have load the weights correctly.
Hi! I have tried Internvideo2-1B-clip in the action recognition task on K400 dataset, I try to use the model without the dataset class you designed. So what I do in vision is catching 8 frames from video, transform it using test_transform, feed the processed clip into the vision encoder to get the 1x768 feature. In text I just use the k400_categories.txt and kinetics_prompt you offered, after the text encoder it's 400x16x768 features. Finally I get these two features in get_sim, and get a rank of the categories, but the result is very bad. the answer is always not in the top5 choices, the model seems to randomly rank the categories. I don't know if there is any wrong. the model I use is chinese_alpaca_lora_7b, InternVideo2-stage2_1b-224p-f4.pt, internvl_c_13b_224px.pth, InternVideo2_CLIP_1B.pth.