OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Apache License 2.0
1.41k stars 85 forks source link

Questions about how to calculate metrics #64

Closed aTunass closed 9 months ago

aTunass commented 10 months ago

Hello, i'm new in this field and I'm a bit confused about how to calculate the metric on the MSRVTT set, when each video will have 20 corresponding descriptive captions. So how do we calculate to get the correlation matrix between captions and videos because the number of videos in the test set is only 2990 and the number of captions is 2990x20=59800, I have read your code but I really haven't seen it yet understand the core point here. Hope you can explain this to me

Andy1621 commented 9 months ago

Hi! For testing, there is only one caption for one video, so that we can calculate the metrics well. If there are multiple captions, there are two ways: (1) Concatenate the caption as a paragraph, as in DiDeMo; (2) Take multiple captions as gt, as in MSVD;