PKU-YuanGroup / Video-LLaVA

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
https://arxiv.org/pdf/2311.10122.pdf
Apache License 2.0
3.02k stars 220 forks source link

Can't reproduce results on MSRVTT and MSVD dataset #191

Closed 1999Lyd closed 2 months ago

1999Lyd commented 2 months ago

Hi, I followed the instructions in the TRAIN_AND_VALIDATE.md file, downloaded the dataset, and ran the evaluation script. However, I only achieved 46% accuracy on MSRVTT and 60% accuracy on MSVD, which are both lower than the results reported in the paper. The checkpoint I used is Video-LLaVA-7B. I’m wondering if there are any additional steps or updated scripts that are not reflected in the README that I should be aware of. Thank you for your help in advance.

wren93 commented 3 weeks ago

Hi, did you figure this out? I got the same result on MSVD

1999Lyd commented 3 weeks ago

Hi, did you figure this out? I got the same result on MSVD

I have found the cause, the input arguments ‘do sample’ should be False when temperature is 0, but it is set to be True in the scripts. I hardcoded the temperature above 0 inside transformers script. Btw, my results are still a little bit lower than the result reported in the paper(about 1-2 points) and results for TGIF dataset is much lower than the reported results( only 43%)

Jingchensun commented 2 weeks ago

The same as me, I set the ‘do sample’ as False and the temperature is 0, only get the 36.2 Accuracy on MSVD dataset. Can the author provide some instrutions?