Closed Luckydog-lhy closed 11 months ago
The following questions are relevant, perhaps you can refer to them. issue 36 issue 34 issue 28 from Video-ChatGPT
We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?
Have you succeed? I met the same problem. I've checked other issues but the problem remains. The results are very close in other datasets, but significantly lower in TGIF-QA dataset (48.62/3.134)
We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?
Have you succeed? I met the same problem. I've checked other issues but the problem remains. The results are very close in other datasets, but significantly lower in TGIF-QA dataset (59.8/3.4)
@LinB203 Could you provide a TGIF-QA results JSON to help us locate the problems?
Besides, we've also found that different gpt-3.5
versions will yield significantly different results.
gpt-3.5-turbo-1106
performs like human evaluators, but gpt-3.5-turbo-0613
and gpt-3.5-turbo-0301
will regard many wrong answers as correct ones. (resulting ~8% higher accuracy)
We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?
Have you succeed? I met the same problem. I've checked other issues but the problem remains. The results are very close in other datasets, but significantly lower in TGIF-QA dataset (59.8/3.4)
@LinB203 Could you provide a TGIF-QA results JSON to help us locate the problems?
Besides, we've also found that different
gpt-3.5
versions will yield significantly different results.
gpt-3.5-turbo-1106
performs like human evaluators, butgpt-3.5-turbo-0613
andgpt-3.5-turbo-0301
will regard many wrong answers as correct ones. (resulting ~8% higher accuracy)
Sure, hope this helpful. https://drive.google.com/file/d/1ArNLI3jF3u6QS1XujiP3O_Q0kEmHjvm_/view?usp=sharing
We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?
Have you succeed? I met the same problem. I've checked other issues but the problem remains. The results are very close in other datasets, but significantly lower in TGIF-QA dataset (59.8/3.4)
@LinB203 Could you provide a TGIF-QA results JSON to help us locate the problems? Besides, we've also found that different
gpt-3.5
versions will yield significantly different results.gpt-3.5-turbo-1106
performs like human evaluators, butgpt-3.5-turbo-0613
andgpt-3.5-turbo-0301
will regard many wrong answers as correct ones. (resulting ~8% higher accuracy)Sure, hope this helpful. https://drive.google.com/file/d/1ArNLI3jF3u6QS1XujiP3O_Q0kEmHjvm_/view?usp=sharing
I found a serious issue here.
You seems to only evaluate on only a subset of TGIF-QA test (specifically, FrameQA). Specifically, the TGIF-QA data you provided in the Baidu drive has 25751
annotations (same as the TGIF official repo). But the result file in your Google Drive has only 13691
samples. And 13691
is the extact number of FrameQA-test samples.
I did not notice any explanation in your paper or code. Is there anything that I have missed?
We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?
Have you succeed? I met the same problem. I've checked other issues but the problem remains. The results are very close in other datasets, but significantly lower in TGIF-QA dataset (59.8/3.4)
@LinB203 Could you provide a TGIF-QA results JSON to help us locate the problems? Besides, we've also found that different
gpt-3.5
versions will yield significantly different results.gpt-3.5-turbo-1106
performs like human evaluators, butgpt-3.5-turbo-0613
andgpt-3.5-turbo-0301
will regard many wrong answers as correct ones. (resulting ~8% higher accuracy)Sure, hope this helpful. https://drive.google.com/file/d/1ArNLI3jF3u6QS1XujiP3O_Q0kEmHjvm_/view?usp=sharing
I found a serious issue here.
You seems to only evaluate on only a subset of TGIF-QA test (specifically, FrameQA). Specifically, the TGIF-QA data you provided in the Baidu drive has
25751
annotations (same as the TGIF official repo). But the result file in your Google Drive has only13691
samples. And13691
is the extact number of FrameQA-test samples.I did not notice any explanation in your paper or code. Is there anything that I have missed?
We follow the Video-ChatGPT, who uses Test_frameqa_question
.
https://github.com/mbzuai-oryx/Video-ChatGPT/issues/65#issuecomment-1871411286
We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?
Have you succeed? I met the same problem. I've checked other issues but the problem remains. The results are very close in other datasets, but significantly lower in TGIF-QA dataset (59.8/3.4)
@LinB203 Could you provide a TGIF-QA results JSON to help us locate the problems? Besides, we've also found that different
gpt-3.5
versions will yield significantly different results.gpt-3.5-turbo-1106
performs like human evaluators, butgpt-3.5-turbo-0613
andgpt-3.5-turbo-0301
will regard many wrong answers as correct ones. (resulting ~8% higher accuracy)Sure, hope this helpful. https://drive.google.com/file/d/1ArNLI3jF3u6QS1XujiP3O_Q0kEmHjvm_/view?usp=sharing
I found a serious issue here. You seems to only evaluate on only a subset of TGIF-QA test (specifically, FrameQA). Specifically, the TGIF-QA data you provided in the Baidu drive has
25751
annotations (same as the TGIF official repo). But the result file in your Google Drive has only13691
samples. And13691
is the extact number of FrameQA-test samples. I did not notice any explanation in your paper or code. Is there anything that I have missed?We follow the Video-ChatGPT, who uses
Test_frameqa_question
. mbzuai-oryx/Video-ChatGPT#65 (comment)
Thank you for your kindness!
We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?
Hello, could you please share your reproduced accuracy and score on MSRVTT to us? This dataset is too large so reproducing it costs a lot. Would love to know the result!
We used the same data as in git and used the officially provided training weights, also evaluated using gpt3.5, but only achieved an accuracy of 47.9/3.1 (vs. 70.0) on the TGIF-QA task.BTW, on the other three QA tasks, were able to obtain metrics similar to those in the paper. What could be the cause of this?