Reproduction of results on MSVD and MSRVTT

ShaneeyS commented 4 months ago

A related issue posted in https://github.com/bytedance/Flash-VStream/issues/2.

After training the model by myself following scripts in this official repo, the evaluation results on MSVD and MSRVTT are largely lower than the ones reported in paper. ~10 points lower

The evaluation script is the same as https://github.com/IVGSZ/Flash-VStream/blob/main/flash_vstream/eval_video/eval_activitynet_qa.py, without any change. The GPT-3.5 turbo version used is 2023-07-01-preview. But I try different GPT-3.5 versions, for example 2024-*, but the results don't seem to change a lot.

As for the official checkpoint, I used the chekpoint in https://huggingface.co/IVGSZ/Flash-VStream-7b. But the error occurs as stated in https://github.com/bytedance/Flash-VStream/issues/2.

Actually, I may not need your original evaluation log file of MSVD and MSRVTT. I just want to know if there are any special training approaches or settings for these two datasets? For example, pretraining / SFT datasets, hyper-parameters, training settings, evaluation settings, ... Since this paper mainly claim their strong abilities on long videos, but the results in Table3 indicates that, the performance in ActivityNet-QA, long video benchmark, has relatively small improvements, but on the 10-seconds-short-videos MSVD-QA and MSRVTT-QA, the performance improves with 10 points. Is this a disagreement with the conclusions of the paper?

Thanks for your reply!

ShaneeyS commented 4 months ago

Or, in other words, could you please tell me where are the performance improvements on MSVD-QA and MSRVTT-QA mainly from? Given that short videos may not actually need a memory...

zhang9302002 commented 4 months ago

Thank you very much!

For the 1st question: As far as we know, MSVD-QA dataset contains about 13k QA pairs. Your experiment result only contains 2.9k yes samples and 1.1k no samples. There seems to be some problems in your dataset preparation. Please make sure that you are using the correct dataset to evaluate.

For the 2nd question: Our proposed method can enhance the video understanding ability generally (not only for long videos). We suppose the improvement on short videos comes from the design of STAR memory, which capture semantic features from different levels. It is certainly consistent with the conclusions of the paper.

ShaneeyS commented 4 months ago

Thank you for the reply!

I have check the MSVD-QA evaluation dataset and re-evaluate the model performance. Here is the results:

100%|█████| 11/11 [00:13<00:00, 1.20s/it] 100%|█████| 572/572 [10:45<00:00, 1.13s/it] 100%|█████| 572/572 [10:48<00:00, 1.13s/it] 100%|█████| 572/572 [10:48<00:00, 1.13s/it] 100%|█████| 572/572 [10:48<00:00, 1.13s/it] 100%|█████| 572/572 [10:49<00:00, 1.14s/it] 100%|█████| 572/572 [10:49<00:00, 1.14s/it] 100%|█████| 572/572 [10:50<00:00, 1.14s/it] 100%|█████| 572/572 [10:51<00:00, 1.14s/it] 100%|█████| 572/572 [10:51<00:00, 1.14s/it] 100%|█████| 572/572 [10:51<00:00, 1.14s/it] 100%|█████| 572/572 [10:52<00:00, 1.14s/it] 100%|█████| 572/572 [10:52<00:00, 1.14s/it] 100%|█████| 572/572 [10:52<00:00, 1.14s/it] 100%|█████| 572/572 [10:53<00:00, 1.14s/it] 100%|█████| 572/572 [10:55<00:00, 1.15s/it] 100%|█████| 572/572 [11:00<00:00, 1.15s/it] completed_files: 13157 incomplete_files: 0 All evaluation completed! Yes count: 9410 No count: 3747 Accuracy: 0.715209 Average score: 3.839325

Total Score Yes/No distribution: yes: 0: 0 1: 0 2: 0 3: 7 4: 2337 5: 7066 no: 0: 914 1: 18 2: 2666 3: 133 4: 14 5: 2

Answer Type Score distribution: Type, Accuracy, Avg_score total, 0.715209, 3.839325

acc, score, total 0.715209, 3.839325, 0.715209

I have also provided my evaluation log files, it would be appreciated if you could help me find whether there are something wrong. results.json

Thanks a lot!

Einstone-rose commented 3 months ago

I have the same question. As for MSVD-QA dataset, the test results vary in 0.70~0.715 when I use the same data (llava_558k_with_webvid.json for pretraining, and llava_v1_5_mix665k_with_video_chatgpt_modify_fmt.json for instruction tuning) mentioned in LLAMA-VID, however I get the lower results than the reported results in paper, around 10 points. I think author should provide the details about training configuration, and dataset used for training.

jsunny0612 commented 2 months ago

I have the same question. As for MSVD-QA dataset, the test results vary in 0.70~0.715 when I use the same data (llava_558k_with_webvid.json for pretraining, and llava_v1_5_mix665k_with_video_chatgpt_modify_fmt.json for instruction tuning) mentioned in LLAMA-VID, however I get the lower results than the reported results in paper, around 10 points. I think author should provide the details about training configuration, and dataset used for training.

I have a question for you. You mentioned using llava_558k_with_webvid.json for pretraining and llava_v1_5_mix665k_with_video_chatgpt_modify_fmt.json for instruction tuning, which seems to correspond to the --data_path in bash scripts/train_and_eval.sh for pretrain and finetune.

Could you please let me know how you specified the image_folder and video_folder during this process? Additionally, could you also let me know where you obtained the MSVD 13k dataset?

Thank you for your response regarding the above.

zhang9302002 commented 1 month ago

Thanks for your attention.

We think this is due to the discrepancy between GPT-3.5-turbo api versions. We used gpt-3.5-turbo-0301 for evaluation, but it is deprecated. We noticed that GPT-3.5-based evaluation result will fluctuate over time and between different api versions, also reported at https://github.com/mbzuai-oryx/Video-ChatGPT/issues/28#issuecomment-1651426975.

For fair comparison, we suggest use the same api version at the same time to test different models, or use a MCQ benchmark instead of open-ended benchmark for Video Question Answer task.

xinyuehuo commented 1 month ago

So, would you like to provide some results on a MCQ benchmark like videomme so that we can align the reproduced model?

IVGSZ / Flash-VStream

Reproduction of results on MSVD and MSRVTT #3