OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Apache License 2.0
1.31k stars 84 forks source link

Some doubts about the absolute value of ViCLIP similarity #175

Open LiuHuijie6410 opened 1 week ago

LiuHuijie6410 commented 1 week ago

Thanks for such beautiful work! In the past, the similarity between video and text was usually calculating the similarity between each frame and text using text-image CLIP, and then take the average. If the text and video are aligned, the value calculated in this way is usually above 0.9. However, The value calculated using ViCLIP is only about 0.3. Could you explain the reason?

zmy1116 commented 1 day ago

Yeah I want to ask this too, not sure how should we threshold out good matches given that cosine similarities are almost always lie between 0.2 -0.4. I can see that softmax(100* score) is used to get relative closeness among a set of candidates but this doesn't help to exclude unmatched candidates .