Open LiuHuijie6410 opened 2 months ago
Yeah I want to ask this too, not sure how should we threshold out good matches given that cosine similarities are almost always lie between 0.2 -0.4. I can see that softmax(100* score)
is used to get relative closeness among a set of candidates but this doesn't help to exclude unmatched candidates .
Thanks for such beautiful work! In the past, the similarity between video and text was usually calculating the similarity between each frame and text using text-image CLIP, and then take the average. If the text and video are aligned, the value calculated in this way is usually above 0.9. However, The value calculated using ViCLIP is only about 0.3. Could you explain the reason?