OpenGVLab / unmasked_teacher

[ICCV2023 Oral] Unmasked Teacher: Towards Training-Efficient Video Foundation Models
https://arxiv.org/abs/2303.16058
MIT License
295 stars 16 forks source link

Retrieval dataset process approach #9

Closed Seolen closed 1 year ago

Seolen commented 1 year ago

Thanks for your impressive work, I have a question to evaluate video-text retrieval: In datasets such as MSVD and MSRVTT, each video is attached with multiple captions. How do you process this problem for retrieval?

Andy1621 commented 1 year ago

Yes, in training data, there are multiple corresponding captions for videos. When training, we do not process the problem and just fine-tune the models with VTC (video-text contrastive) and VTM (video-text matching) loss.

In testing data, there is only one caption for a video.