Hello, i'm new in this field and I'm a bit confused about how to calculate the metric on the MSRVTT set, when each video will have 20 corresponding descriptive captions. So how do we calculate to get the correlation matrix between captions and videos because the number of videos in the test set is only 2990 and the number of captions is 2990x20=59800, I have read your code but I really haven't seen it yet understand the core point here. Hope you can explain this to me
Hi! For testing, there is only one caption for one video, so that we can calculate the metrics well. If there are multiple captions, there are two ways:
(1) Concatenate the caption as a paragraph, as in DiDeMo;
(2) Take multiple captions as gt, as in MSVD;
Hello, i'm new in this field and I'm a bit confused about how to calculate the metric on the MSRVTT set, when each video will have 20 corresponding descriptive captions. So how do we calculate to get the correlation matrix between captions and videos because the number of videos in the test set is only 2990 and the number of captions is 2990x20=59800, I have read your code but I really haven't seen it yet understand the core point here. Hope you can explain this to me