TXH-mercury / VALOR

Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
https://arxiv.org/abs/2304.08345
MIT License
260 stars 16 forks source link

Questions about how to calculate metrics #22

Closed aTunass closed 5 months ago

aTunass commented 10 months ago

Hello, i'm new in this field and I'm a bit confused about how to calculate the metric on the MSRVTT set, when each video will have 20 corresponding descriptive captions. So how do we calculate to get the correlation matrix between captions and videos because the number of videos in the test set is only 2990 and the number of captions is 2990x20=59800, I have read your code but I really haven't seen it yet understand the core point here. Hope you can explain this to me

TXH-mercury commented 5 months ago

If you mean text-to-video retrieval task, 1kA test set is used (only 1000videos and 1000captions are tested.)