Closed Seolen closed 1 year ago
Yes, in training data, there are multiple corresponding captions for videos. When training, we do not process the problem and just fine-tune the models with VTC (video-text contrastive) and VTM (video-text matching) loss.
In testing data, there is only one caption for a video.
Thanks for your impressive work, I have a question to evaluate video-text retrieval: In datasets such as MSVD and MSRVTT, each video is attached with multiple captions. How do you process this problem for retrieval?