EvolvingLMMs-Lab / LongVA

Long Context Transfer from Language to Vision
Apache License 2.0
327 stars 17 forks source link

The Potential Reason of LLaVA-NeXT-Qwen2's Strong Performance #6

Open waxnkw opened 4 months ago

waxnkw commented 4 months ago

Great work! I notice the LLaVA-NeXT-Qwen2 (image model) can achieve a surprising 49.5 Video-MME results. In contrast, the LLaVA-NeXT-Video (Llama3) can only achieve a 30+ Video-MME score (according to https://arxiv.org/pdf/2406.07476 reproduction). The LLaVA-NeXT-Video (Llama3) also cover a normal LLaVA recipe and even with more video data. I am curious that what is the key factor of LLaVA-NeXT-Qwen2's strong performance compared with LLaVA-NeXT-Video (Llama3). Is the main improvement from Qwen2 LLM?

jzhang38 commented 4 months ago

In contrast, the LLaVA-NeXT-Video (Llama3) can only achieve a 30+ Video-MME score

LLaVA-NeXT-Video-7B (https://huggingface.co/lmms-lab/LLaVA-NeXT-Video-7B) is based on Vicuna, not Llama3. In our test, it scores 40+ on Video-MME, see table 2 of our blog: https://lmms-lab.github.io/posts/lmms-eval-0.2/

waxnkw commented 4 months ago

Thanks so much for the response! Also sorry for my mistakes. I see that the result is 41.98 in Table 2. Great result!

BTW, are there some insights of the improvement from 41.98 (LLaVA-NeXT-Video) to 49.5 (LLaVA-NeXT-Qwen2)?

jzhang38 commented 4 months ago

LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild

https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/

Hi I believe this blog post is a good read about how better base LM enables stronger multimodal capabilities. I believe Qwen2 is just significantly better than Vicuna-1.5(llama2)