Open jongwoopark7978 opened 1 month ago
Hi, thanks for your interest!
The NextQA results reported in the LLaVA-NeXT-Video-DPO (34B) is its OE setting, but in LLaVA-NeXT-Video-32B-Qwen is the MC setting.
I see. Thanks for the clarification. Are both of them a zero-shot setting?
Also, the egoschema accuracy is for the subset (500 videos) or the fullset (5031 videos)?
I see. Thanks for the clarification. Are both of them a zero-shot setting?
Also, the egoschema accuracy is for the subset (500 videos) or the fullset (5031 videos)?
For NextQA, it is not zero-shot, as we include around 1000 qa pairs in the training. But for egoschema, it is. And we use the full set
Hi Team,
I saw that LLaVA-NeXT-Video-32B-Qwen obtains 77.31%, 63% accuracy on NeXT-QA and Egoschema here: https://huggingface.co/lmms-lab/LLaVA-NeXT-Video-32B-Qwen.
On the other hand, LLaVA-NeXT-Video-DPO (34B) achieves 27.30% accuracy on NeXT-QA dataset.
Why the accuracy differes a lot? Did LLaVA-NeXT-Video-32B-Qwen used separate LLM to solve the question and LLaVA-NeXT-Video-DPO (34B) answered the question within VLM itself?
Thank you for your answer in advance.