LLaVA-VL / LLaVA-NeXT

Apache License 2.0
2.4k stars 167 forks source link

Different Reported Results on NeXT-QA and Egoschema #110

Open jongwoopark7978 opened 1 month ago

jongwoopark7978 commented 1 month ago

Hi Team,

I saw that LLaVA-NeXT-Video-32B-Qwen obtains 77.31%, 63% accuracy on NeXT-QA and Egoschema here: https://huggingface.co/lmms-lab/LLaVA-NeXT-Video-32B-Qwen.

On the other hand, LLaVA-NeXT-Video-DPO (34B) achieves 27.30% accuracy on NeXT-QA dataset.

Why the accuracy differes a lot? Did LLaVA-NeXT-Video-32B-Qwen used separate LLM to solve the question and LLaVA-NeXT-Video-DPO (34B) answered the question within VLM itself?

Thank you for your answer in advance.

ZhangYuanhan-AI commented 1 month ago

Hi, thanks for your interest!

The NextQA results reported in the LLaVA-NeXT-Video-DPO (34B) is its OE setting, but in LLaVA-NeXT-Video-32B-Qwen is the MC setting.

jongwoopark7978 commented 1 month ago

I see. Thanks for the clarification. Are both of them a zero-shot setting?

Also, the egoschema accuracy is for the subset (500 videos) or the fullset (5031 videos)?

ZhangYuanhan-AI commented 1 month ago

I see. Thanks for the clarification. Are both of them a zero-shot setting?

Also, the egoschema accuracy is for the subset (500 videos) or the fullset (5031 videos)?

For NextQA, it is not zero-shot, as we include around 1000 qa pairs in the training. But for egoschema, it is. And we use the full set