PKU-YuanGroup / LLaVA-CoT

LLaVA-CoT, a visual language model capable of spontaneous, systematic reasoning
Apache License 2.0
1.35k stars 47 forks source link

Compare with Qwen2-VL #2

Open SushantGautam opened 6 days ago

XuGW-Kevin commented 6 days ago

Great issue! Sorry for the overlook of Qwen2-VL. Qwen2-VL is strong; we have no reason not to compare. We're going to release a model trained on Qwen2-VL in 2-3 weeks, and update our paper by that time.

However, if you really want to know the performance of Qwen2-VL-7B on our reasoning benchmark now, I can give you the answer: 65.85. Our LLaVA-o1(Llama-3.2-11B-Vision) is 65.8. Our LLaVA-o1(Llama-3.2-11B-Vision) is worse than Qwen2-VL-7B. This is correct. However, we must point out that Qwen2-VL-7B used huge amounts of training data (at least millions, I believe) and we only used 100k. I think the most important thing is our model has significant improvements over Llama-3.2-Vision-Instruct because one can always switch to a stronger base model (like Qwen2-VL) and use more data (scale up the dataset).

To prove this, we will publish a model trained on Qwen2-VL soon. Stay tuned!

Elenore1997 commented 6 days ago

Great issue! Sorry for the overlook of Qwen2-VL. Qwen2-VL is strong; we have no reason not to compare. We're going to release a model trained on Qwen2-VL in 2-3 weeks, and update our paper by that time.

However, if you really want to know the performance of Qwen2-VL-7B on our reasoning benchmark now, I can give you the answer: 65.85. Our LLaVA-o1(Llama-3.2-11B-Vision) is 65.8. Our LLaVA-o1(Llama-3.2-11B-Vision) is worse than Qwen2-VL-7B. This is correct. However, we must point out that Qwen2-VL-7B used huge amounts of training data (at least millions, I believe) and we only used 100k. I think the most important thing is our model has significant improvements over Llama-3.2-Vision-Instruct because one can always switch to a stronger base model (like Qwen2-VL) and use more data (scale up the dataset).

To prove this, we will publish a model trained on Qwen2-VL soon. Stay tuned!

Great job! I would like to ask that how the LLaVA-O1 model and the O1 model based on Qwen2VL perform on Chinese language-image reasoning / understanding tasks? Thanks in advanced!

zhangfaen commented 7 hours ago

+1 for comparition with Qwen-VL 2

XuGW-Kevin commented 7 hours ago

+1 for comparition with Qwen-VL 2

Thanks for your interest! @zhangfaen We're making progress on this. We'll release the comparison within a week.