Hi, thanks for your great work!
In the InternVid paper, it said that the video-text matching loss is also added for ViCLIP during text-video retrieval fine-tuning.
What confuses me is how exactly this loss is done? Isn't the video-text matching loss usually achieved by adding new binary classification head upon cross-modal features? What should this be done for ViCLIP?
Looking forward to your reply~
@shepnerd
Hi, thanks for your great work! In the InternVid paper, it said that the video-text matching loss is also added for ViCLIP during text-video retrieval fine-tuning. What confuses me is how exactly this loss is done? Isn't the video-text matching loss usually achieved by adding new binary classification head upon cross-modal features? What should this be done for ViCLIP? Looking forward to your reply~ @shepnerd