Open SushantGautam opened 6 days ago
Great issue! Sorry for the overlook of Qwen2-VL. Qwen2-VL is strong; we have no reason not to compare. We're going to release a model trained on Qwen2-VL in 2-3 weeks, and update our paper by that time.
However, if you really want to know the performance of Qwen2-VL-7B on our reasoning benchmark now, I can give you the answer: 65.85. Our LLaVA-o1(Llama-3.2-11B-Vision) is 65.8. Our LLaVA-o1(Llama-3.2-11B-Vision) is worse than Qwen2-VL-7B. This is correct. However, we must point out that Qwen2-VL-7B used huge amounts of training data (at least millions, I believe) and we only used 100k. I think the most important thing is our model has significant improvements over Llama-3.2-Vision-Instruct because one can always switch to a stronger base model (like Qwen2-VL) and use more data (scale up the dataset).
To prove this, we will publish a model trained on Qwen2-VL soon. Stay tuned!
Great job! I would like to ask that how the LLaVA-O1 model and the O1 model based on Qwen2VL perform on Chinese language-image reasoning / understanding tasks? Thanks in advanced!
+1 for comparition with Qwen-VL 2
+1 for comparition with Qwen-VL 2
Thanks for your interest! @zhangfaen We're making progress on this. We'll release the comparison within a week.
Great issue! Sorry for the overlook of Qwen2-VL. Qwen2-VL is strong; we have no reason not to compare. We're going to release a model trained on Qwen2-VL in 2-3 weeks, and update our paper by that time.
However, if you really want to know the performance of Qwen2-VL-7B on our reasoning benchmark now, I can give you the answer: 65.85. Our LLaVA-o1(Llama-3.2-11B-Vision) is 65.8. Our LLaVA-o1(Llama-3.2-11B-Vision) is worse than Qwen2-VL-7B. This is correct. However, we must point out that Qwen2-VL-7B used huge amounts of training data (at least millions, I believe) and we only used 100k. I think the most important thing is our model has significant improvements over Llama-3.2-Vision-Instruct because one can always switch to a stronger base model (like Qwen2-VL) and use more data (scale up the dataset).
To prove this, we will publish a model trained on Qwen2-VL soon. Stay tuned!