Some doubts about the absolute value of ViCLIP similarity

OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding

Apache License 2.0

1.44k stars 88 forks source link

Some doubts about the absolute value of ViCLIP similarity #175

Open LiuHuijie6410 opened 2 months ago

LiuHuijie6410 commented 2 months ago

Thanks for such beautiful work！ In the past, the similarity between video and text was usually calculating the similarity between each frame and text using text-image CLIP, and then take the average. If the text and video are aligned, the value calculated in this way is usually above 0.9. However, The value calculated using ViCLIP is only about 0.3. Could you explain the reason?

zmy1116 commented 2 months ago

Yeah I want to ask this too, not sure how should we threshold out good matches given that cosine similarities are almost always lie between 0.2 -0.4. I can see that softmax(100* score) is used to get relative closeness among a set of candidates but this doesn't help to exclude unmatched candidates .