facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.5k stars 6.41k forks source link

Inferior performance of VideoClip on Video-text retrieval task using COIN dataset. #4410

Open DuL1nk opened 2 years ago

DuL1nk commented 2 years ago

We test the performance of VideoClip through the video-text retrieval task on the COIN dataset, but the performance is much lower than the reported performance of VideoQA (26%<< 74%), which can be formulated as a video-text retrieval task, in the paper.

We follow the inference demo and search for the most similar label from the task-level candidate label pool for every video clip in the COIN dataset. The accuracy is about 26% (<< 74% reported on MSR-VTT). Considering the domain shift from HowTo100M to MSR-VTT and the domain shift from HowTo100M to COIN, we wish VideoClip to perform better on COIN. Is there any possible reason might cause the inferior performance on COIN, or what else in code is worth noticing? Thanks a lot!

qingy1337 commented 2 months ago

Hi, could you please share your package versions and pip version and anything related? I can't seem to make the example run on my computer.