OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Apache License 2.0
1.41k stars 85 forks source link

Can't reproduce zero-shot results on MSR-VTT #32

Closed Kirillova-Anastasia closed 1 year ago

Kirillova-Anastasia commented 1 year ago

Hi, Thanks for your interesting work! I'm running your ./zeroshot_scripts/eval_msrvtt.sh script, changing nothing in parameters, but I get a bit different results from yours:

INFO:logger:DSL Text-to-Video:
INFO:logger:    >>>  R@1: 38.0 - R@5: 65.0 - R@10: 73.5  - Median R: 2.0 - Mean R: 27.0
INFO:logger:DSL Video-to-Text:
INFO:logger:    >>>  V2T$R@1: 41.0 - V2T$R@5: 65.2 - V2T$R@10: 74.4  - V2T$Median R: 2.0 - V2T$Mean R: 20.6
INFO:logger:------------------------------------------------------------
INFO:logger:Text-to-Video:
INFO:logger:    >>>  R@1: 35.4 - R@5: 58.3 - R@10: 68.4  - Median R: 3.0 - Mean R: 32.3
INFO:logger:Video-to-Text:
INFO:logger:    >>>  V2T$R@1: 31.7 - V2T$R@5: 54.5 - V2T$R@10: 65.1 - V2T$Median R: 4.0 - V2T$Mean R: 34.6

I have several questions:

  1. Were the results in the paper and README file got with dual softmax loss?
  2. Are the parameters in zero-shot script the same as ones you used in your zero-shot experiment?
  3. Do you think that some minor dependencies can affect results in such way? If so, can you publish a file with your environment's dependencies?
Jazzcharles commented 1 year ago
  1. The reported results are obtained with dual softmax loss added. Please refer to the results in the README file for the checkpoint released.
  2. Yes. they should be the same.
  3. It's encouraged to check the zero-shot performance on the other datasets, e.g. MSVD. to see whether the performance matches.

The results seem a little bit abnormal as the T2V R@1 is always higher than V2T R@1 on MSR-VTT in our experiments.