I am using the checkpoint LanguageBind/Video-LLaVA-7B and have set do_sample=False and temperature=0.0 in run_inference_video_qa.py. For inference on the MSVD-QA dataset, I used 4 * A6000 GPUs, and the process took about one hour. However, when I evaluated the prediction results using GPT-3.5 (default setting), I only achieved an accuracy of 36.27% and a score of 2.87, which is significantly lower than the results reported in the paper. Similarly, on the TGIF-QA dataset, I obtained an accuracy of only 19.6% and a score of 2.4. For other evaluation tasks, such as VQA (e.g., VQAv2, GQA), my results perfectly matched those reported in the paper.
Could the authors provide feedback on the evaluation for the video-QA task? Is there an alternative evaluation method to using GPT?
First, thank you for the amazing work.
I am using the checkpoint
LanguageBind/Video-LLaVA-7B
and have setdo_sample=False
andtemperature=0.0
inrun_inference_video_qa.py
. For inference on the MSVD-QA dataset, I used 4 * A6000 GPUs, and the process took about one hour. However, when I evaluated the prediction results using GPT-3.5 (default setting), I only achieved an accuracy of 36.27% and a score of 2.87, which is significantly lower than the results reported in the paper. Similarly, on the TGIF-QA dataset, I obtained an accuracy of only 19.6% and a score of 2.4. For other evaluation tasks, such as VQA (e.g., VQAv2, GQA), my results perfectly matched those reported in the paper.Could the authors provide feedback on the evaluation for the video-QA task? Is there an alternative evaluation method to using GPT?