Inferior performance of VideoClip on Video-text retrieval task using COIN dataset.

We test the performance of VideoClip through the video-text retrieval task on the COIN dataset, but the performance is much lower than the reported performance of VideoQA (26%<< 74%), which can be formulated as a video-text retrieval task, in the paper.

We follow the inference demo and search for the most similar label from the task-level candidate label pool for every video clip in the COIN dataset. The accuracy is about 26% (<< 74% reported on MSR-VTT). Considering the domain shift from HowTo100M to MSR-VTT and the domain shift from HowTo100M to COIN, we wish VideoClip to perform better on COIN. Is there any possible reason might cause the inferior performance on COIN, or what else in code is worth noticing? Thanks a lot!

facebookresearch / fairseq

Inferior performance of VideoClip on Video-text retrieval task using COIN dataset. #4410