PKU-YuanGroup / Video-LLaVA

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
https://arxiv.org/pdf/2311.10122.pdf
Apache License 2.0
2.98k stars 216 forks source link

Questions about the reproduction of the TGIF-QA #37

Closed Luckydog-lhy closed 11 months ago

Luckydog-lhy commented 11 months ago

We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?

LinB203 commented 11 months ago

The following questions are relevant, perhaps you can refer to them. issue 36 issue 34 issue 28 from Video-ChatGPT

SCZwangxiao commented 9 months ago

We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?

Have you succeed? I met the same problem. I've checked other issues but the problem remains. The results are very close in other datasets, but significantly lower in TGIF-QA dataset (48.62/3.134)

SCZwangxiao commented 9 months ago

We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?

Have you succeed? I met the same problem. I've checked other issues but the problem remains. The results are very close in other datasets, but significantly lower in TGIF-QA dataset (59.8/3.4)

@LinB203 Could you provide a TGIF-QA results JSON to help us locate the problems?

Besides, we've also found that different gpt-3.5 versions will yield significantly different results.

gpt-3.5-turbo-1106 performs like human evaluators, but gpt-3.5-turbo-0613 and gpt-3.5-turbo-0301 will regard many wrong answers as correct ones. (resulting ~8% higher accuracy)

LinB203 commented 9 months ago

We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?

Have you succeed? I met the same problem. I've checked other issues but the problem remains. The results are very close in other datasets, but significantly lower in TGIF-QA dataset (59.8/3.4)

@LinB203 Could you provide a TGIF-QA results JSON to help us locate the problems?

Besides, we've also found that different gpt-3.5 versions will yield significantly different results.

gpt-3.5-turbo-1106 performs like human evaluators, but gpt-3.5-turbo-0613 and gpt-3.5-turbo-0301 will regard many wrong answers as correct ones. (resulting ~8% higher accuracy)

Sure, hope this helpful. https://drive.google.com/file/d/1ArNLI3jF3u6QS1XujiP3O_Q0kEmHjvm_/view?usp=sharing

SCZwangxiao commented 9 months ago

We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?

Have you succeed? I met the same problem. I've checked other issues but the problem remains. The results are very close in other datasets, but significantly lower in TGIF-QA dataset (59.8/3.4)

@LinB203 Could you provide a TGIF-QA results JSON to help us locate the problems? Besides, we've also found that different gpt-3.5 versions will yield significantly different results. gpt-3.5-turbo-1106 performs like human evaluators, but gpt-3.5-turbo-0613 and gpt-3.5-turbo-0301 will regard many wrong answers as correct ones. (resulting ~8% higher accuracy)

Sure, hope this helpful. https://drive.google.com/file/d/1ArNLI3jF3u6QS1XujiP3O_Q0kEmHjvm_/view?usp=sharing

I found a serious issue here.

You seems to only evaluate on only a subset of TGIF-QA test (specifically, FrameQA). Specifically, the TGIF-QA data you provided in the Baidu drive has 25751 annotations (same as the TGIF official repo). But the result file in your Google Drive has only 13691 samples. And 13691 is the extact number of FrameQA-test samples.

I did not notice any explanation in your paper or code. Is there anything that I have missed?

LinB203 commented 9 months ago

We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?

Have you succeed? I met the same problem. I've checked other issues but the problem remains. The results are very close in other datasets, but significantly lower in TGIF-QA dataset (59.8/3.4)

@LinB203 Could you provide a TGIF-QA results JSON to help us locate the problems? Besides, we've also found that different gpt-3.5 versions will yield significantly different results. gpt-3.5-turbo-1106 performs like human evaluators, but gpt-3.5-turbo-0613 and gpt-3.5-turbo-0301 will regard many wrong answers as correct ones. (resulting ~8% higher accuracy)

Sure, hope this helpful. https://drive.google.com/file/d/1ArNLI3jF3u6QS1XujiP3O_Q0kEmHjvm_/view?usp=sharing

I found a serious issue here.

You seems to only evaluate on only a subset of TGIF-QA test (specifically, FrameQA). Specifically, the TGIF-QA data you provided in the Baidu drive has 25751 annotations (same as the TGIF official repo). But the result file in your Google Drive has only 13691 samples. And 13691 is the extact number of FrameQA-test samples.

I did not notice any explanation in your paper or code. Is there anything that I have missed?

We follow the Video-ChatGPT, who uses Test_frameqa_question. https://github.com/mbzuai-oryx/Video-ChatGPT/issues/65#issuecomment-1871411286

SCZwangxiao commented 9 months ago

We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?

Have you succeed? I met the same problem. I've checked other issues but the problem remains. The results are very close in other datasets, but significantly lower in TGIF-QA dataset (59.8/3.4)

@LinB203 Could you provide a TGIF-QA results JSON to help us locate the problems? Besides, we've also found that different gpt-3.5 versions will yield significantly different results. gpt-3.5-turbo-1106 performs like human evaluators, but gpt-3.5-turbo-0613 and gpt-3.5-turbo-0301 will regard many wrong answers as correct ones. (resulting ~8% higher accuracy)

Sure, hope this helpful. https://drive.google.com/file/d/1ArNLI3jF3u6QS1XujiP3O_Q0kEmHjvm_/view?usp=sharing

I found a serious issue here. You seems to only evaluate on only a subset of TGIF-QA test (specifically, FrameQA). Specifically, the TGIF-QA data you provided in the Baidu drive has 25751 annotations (same as the TGIF official repo). But the result file in your Google Drive has only 13691 samples. And 13691 is the extact number of FrameQA-test samples. I did not notice any explanation in your paper or code. Is there anything that I have missed?

We follow the Video-ChatGPT, who uses Test_frameqa_question. mbzuai-oryx/Video-ChatGPT#65 (comment)

Thank you for your kindness!

Leo-Yuyang commented 6 months ago

We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?

Hello, could you please share your reproduced accuracy and score on MSRVTT to us? This dataset is too large so reproducing it costs a lot. Would love to know the result!