How to conduct the video-text matching loss mentioned in the paper for ViCLIP during text-video retrieval fine-tuning

OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding

Apache License 2.0

1.41k stars 85 forks source link

How to conduct the video-text matching loss mentioned in the paper for ViCLIP during text-video retrieval fine-tuning #111

Closed jpWang closed 6 months ago

jpWang commented 6 months ago

Hi, thanks for your great work！ In the InternVid paper, it said that the video-text matching loss is also added for ViCLIP during text-video retrieval fine-tuning. What confuses me is how exactly this loss is done? Isn't the video-text matching loss usually achieved by adding new binary classification head upon cross-modal features? What should this be done for ViCLIP? Looking forward to your reply~ @shepnerd

leexinhao commented 6 months ago

This was a clerical error and we only used contrastive loss during the full-finetuning setting.