PKU-YuanGroup / Video-LLaVA

【EMNLP 2024πŸ”₯】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
https://arxiv.org/pdf/2311.10122.pdf
Apache License 2.0
2.87k stars 207 forks source link

Evaluation on MSVD difference from MS (0.49, 2.9 vs 0.703, 3.9) #96

Open orrzohar opened 7 months ago

orrzohar commented 7 months ago

When I evaluate the model you released, I get the following:

completed_files: 0
incomplete_files: 13157
Error processing file 'v_klFyrnrUSck_87_100_6': invalid syntax (<unknown>, line 1)
Error processing file 'v_klteYv1Uv9A_27_33_14': invalid syntax (<unknown>, line 1)
Error processing file 'v_k06Ge9ANKM8_5_16_14': invalid syntax (<unknown>, line 1)
Error processing file 'v_t4vP-cXXWkY_14_20_1': invalid syntax (<unknown>, line 1)
Error processing file 'v_tBj4Ny19vfQ_54_59_0': invalid syntax (<unknown>, line 1)
Error processing file 'v_wLUH7qA_6sA_90_115_14': invalid syntax (<unknown>, line 1)
Error processing file 'v_pFSoWsocv0g_8_17_2': invalid syntax (<unknown>, line 1)
Error processing file 'v_tn1d5DmdMqY_15_28_4': invalid syntax (<unknown>, line 1)
Error processing file 'v_n_Z0-giaspE_270_278_1': invalid syntax (<unknown>, line 1)
Error processing file 'v_uAaWVeaYLdQ_1_12_1': invalid syntax (<unknown>, line 1)
Error processing file 'v_pptYu3YQnxY_160_170_20': invalid syntax (<unknown>, line 1)
Error processing file 'v_o4pL7FObqds_243_263_27': invalid syntax (<unknown>, line 1)
Error processing file 'v_o4pL7FObqds_243_263_30': invalid syntax (<unknown>, line 1)
Error processing file 'v_uxEhH6MPH28_69_85_10': invalid syntax (<unknown>, line 1)
Error processing file 'v_qPXynwa_2iM_15_25_17': invalid syntax (<unknown>, line 1)
Error processing file 'v_kk3TIio1-Uw_5_14_21': invalid syntax (<unknown>, line 1)
Error processing file 'v_qeKX-N1nKiM_0_5_3': invalid syntax (<unknown>, line 1)
completed_files: 13140
incomplete_files: 17
completed_files: 13157
incomplete_files: 0
All evaluation completed!
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 13157/13157 [00:00<00:00, 1106322.20it/s]
Yes count: 6462
No count: 6695
Accuracy: 0.4911453978870563
Average score: 2.9223227179448203

All I did was:

  1. download your model via running the inference script you provided.
  2. ran eval CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_msvd.sh
  3. Evaluated QA: bash scripts/v1_5/eval/eval_qa_msvd.sh
LinB203 commented 7 months ago

We were confused about this as well, however I repeated the experiment twice and was able to get over 67 correct. Nonetheless, there are still people who claim to be unable to reproduce the performance of MSVD, e.g., https://github.com/PKU-YuanGroup/Video-LLaVA/issues/36#issue-2031834153. However, we have also observed that some people are able to reproduce the same results as we did, e.g., https://github.com/PKU-YuanGroup/Video-LLaVA/issues/37#issue-2032217679, https://github.com/PKU-YuanGroup/Video-LLaVA/issues/36#issuecomment-1926301528. I think that this may be due to inconsistent results due to version migration of GPT.

Also I have observed similar problems in other work. https://github.com/mbzuai-oryx/Video-ChatGPT/issues/28

Maybe we should find some more stable non-GPT evaluation method.

orrzohar commented 7 months ago

I mean, an easy variation would be to use Vicuna as the weights are open-source it would be more comparable...

At the very least, it would make sense to set the temperature to 0, as at least the generated text would have less randomness. I am not sure what do to about the version migration; seems like an issue if every time chatGPT outputs a new migration, all the numbers need to be updated for reproducibility.

orrzohar commented 7 months ago

By the way; when I evaluate TGIF: Yes count: 9249 No count: 16502 Accuracy: 0.3591705176498 Average score: 2.5519785639392647

LinB203 commented 7 months ago

What's your GPT version? We use gpt-3.5-turbo.

orrzohar commented 7 months ago

I didn't change your eval files; the default you use is GPT3.5:

https://github.com/PKU-YuanGroup/Video-LLaVA/blob/e93f4927eaa926ed8450b481fde95c994ed23d2d/videollava/eval/video/eval_video_qa.py#L39

orrzohar commented 7 months ago

The reason I think you may have uploaded the wrong model to transformers is that I get the following (top row is the model you released; bottom row is a model I pretrained myself with similar data and instruction tuning hyperparameters):

image
LinB203 commented 7 months ago

The reason I think you may have uploaded the wrong model to transformers is that I get the following (top row is the model you released; bottom row is a model I pretrained myself with similar data and instruction tuning hyperparameters): image

I will check it.