PKU-YuanGroup / LLaVA-CoT

LLaVA-CoT, a visual language model capable of spontaneous, systematic reasoning
Apache License 2.0
1.36k stars 47 forks source link

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

This project was previously named LLaVA-o1. We have renamed it to LLaVA-CoT.

If you like our project, please give us a star ⭐ on GitHub for the latest update.
[![Demo](https://img.shields.io/badge/⚡-Hugging%20Face%20Demo-yellow.svg)](https://huggingface.co/spaces/Xkev/Llama-3.2V-11B-cot) [![hf](https://img.shields.io/badge/🤗-Hugging%20Face-blue.svg)](https://huggingface.co/Xkev/Llama-3.2V-11B-cot) [![arXiv](https://img.shields.io/badge/Arxiv-2411.10440-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2411.10440) [![X (formerly Twitter) URL](https://img.shields.io/twitter/url?url=https%3A%2F%2Fx.com%2FKevin_GuoweiXu%2Fstatus%2F1858338565463421244)](https://x.com/Kevin_GuoweiXu/status/1858338565463421244) [![License](https://img.shields.io/badge/Code%20License-Apache2.0-yellow)](https://github.com/PKU-YuanGroup/LLaVA-CoT/blob/main/LICENSE) [![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FPKU-YuanGroup%2FLLaVA-CoT&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=hits&edge_flat=false)](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FPKU-YuanGroup%2FLLaVA-CoT&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=hits&edge_flat=false)
本项目希望构建开源慢思考大模型,由北大-兔展AIGC联合实验室共同发起。

📣 News

🔥 Highlights

LLaVA-CoT is the first visual language model capable of spontaneous, systematic reasoning, similar to GPT-o1!

Our 11B model outperforms Gemini-1.5-pro,GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on six challenging multimodal benchmarks.

🚀 Demos

LLaVA-CoT begins by outlining the problem, interprets relevant information from the image, proceeds step-by-step through reasoning, and ultimately reaches a well-supported conclusion.

Reasoning Problems

Question
Subtract all tiny shiny balls. Subtract all purple objects. How many objects are left? Options: A. 4, B. 8, C. 2, D. 6
Llama-3.2-11B-Vision-Instruct To determine the correct answer, we need to follow the instructions provided.
The first instruction is to subtract all tiny shiny balls.
Since there are three tiny shiny balls in the image, we would be left with no shiny balls after performing this action.
The second instruction is to subtract all purple objects. However, upon closer inspection, it becomes apparent that there are no shiny purple objects present in the image. Therefore, we cannot subtract any purple objects.
Given that we have already removed all the tiny shiny balls and there were no purple objects to begin with, the correct answer is C. 2, as initially, there were 2 purple objects, but they are not to be subtracted according to the instruction.
LLaVA-CoT Summary stage: What's the problem? What should I do?
<SUMMARY> I will examine the image to identify and count all the objects, focusing on the tiny shiny balls and purple objects. Then, I will subtract these objects from the total count and determine the number of remaining objects. </SUMMARY>
Caption stage: What can I know from the image?
<CAPTION> The image shows various 3D geometric shapes on a flat surface, including spheres, cubes, and cylinders in different colors. Notably, there are small shiny green spheres and a purple cylinder. </CAPTION>
Reasoning stage: How to solve the problem step-by-step?
<REASONING> First, I will count all the objects in the image: there are 10 objects in total. Next, I will identify and subtract the tiny shiny balls. There is one small shiny green sphere. Then, I will identify and subtract the purple objects. There is one purple cylinder. By subtracting the two objects (the shiny ball and the purple cylinder) from the total, I find that 8 objects remain. </REASONING>
Conclusion stage: What is the final answer?
<CONCLUSION> B </CONCLUSION>

Science Problems

Question
An acetal polymer block is fixed to the rigid plates at its top and bottom surfaces. If the top plate displaces 2 mm horizontally when it is subjected to a horizontal force ( P = 2 \, \text{kN} ), determine the shear modulus of the polymer. The width of the block is 100 mm. Assume that the polymer is linearly elastic and use small angle analysis.
Options: A: 15 MPa, B: 5 MPa, C: 20 MPa, D: 2 MPa
Llama-3.2-11B-Vision-Instruct To determine the shear modulus of the polymer, we can use the formula for shear modulus in a small angle analysis:
( G = \frac{F}{A \gamma} )
where ( G ) is the shear modulus, ( F ) is the applied force, ( A ) is the area of the block, and ( \gamma ) is the shear strain.
Given that the top plate of the block displaces horizontally by 2 mm when subjected to a horizontal force of 2 kN, we can calculate the shear strain ( \gamma ). Since the displacement is horizontal and the original length of the block is 400 mm, the shear strain ( \gamma ) is:
[ \gamma = \frac{\Delta x}{L} = \frac{2 \, \text{mm}}{400 \, \text{mm}} = 0.005 ]
Next, we calculate the area ( A ) of the block. The width of the block is given as 100 mm, and assuming the height of the block is also 200 mm (as given in the image analysis), the area ( A ) is:
[ A = 100 \, \text{mm} \times 200 \, \text{mm} = 20000 \, \text{mm}^2 ]
Now, we can substitute these values into the formula for shear modulus:
[ G = \frac{F}{A \gamma} = \frac{2000 \, \text{N}}{20000 \, \text{mm}^2 \times 0.005} = \frac{2000}{100} = 20 \, \text{MPa} ]
Therefore, the correct answer is C: 20 MPa.
LLaVA-CoT Summary stage: What's the problem? What should I do?
<SUMMARY> To solve the problem, I will determine the shear modulus of the acetal polymer block using the relationship between the applied force, displacement, and dimensions of the block. I will use the formula for shear modulus and calculate the result. </SUMMARY>
Caption stage: What can I know from the image?
<CAPTION> The image shows a rectangular block of acetal polymer fixed between two rigid plates. The top plate moves horizontally by 2 mm when a force ( P = 2 \, \text{kN} ) is applied. The block's dimensions are given as 400 mm in length, 200 mm in height, and 100 mm in width. </CAPTION>
Reasoning stage: How to solve the problem step-by-step?
<REASONING> The shear modulus ( G ) of the polymer can be calculated using the formula:
[ G = \frac{F}{A \cdot \frac{\Delta x}{L}} ]
where:
- ( F ) is the applied force (2 kN),
- ( A ) is the area of the block's top surface (100 mm x 400 mm = 40,000 mm²),
- ( \Delta x ) is the horizontal displacement (2 mm),
- ( L ) is the height of the block (200 mm).
First, convert the force to Newtons: ( F = 2 \, \text{kN} = 2000 \, \text{N} ).
Next, calculate the shear modulus:
[ G = \frac{2000}{40,000 \cdot \frac{2}{200}} ]
[ G = \frac{2000}{40,000 \cdot 0.01} ]
[ G = \frac{2000}{400} ]
[ G = 5 \, \text{MPa} ]
</REASONING>
Conclusion stage: What is the final answer?
<CONCLUSION> B: 5 MPa </CONCLUSION>

🛠️ Installation

Stay tuned! Our code, dataset, and pretrain weights are coming soon.

📝 Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

@misc{xu2024llavao1letvisionlanguage,
      title={LLaVA-o1: Let Vision Language Models Reason Step-by-Step},
      author={Guowei Xu and Peng Jin and Li Hao and Yibing Song and Lichao Sun and Li Yuan},
      year={2024},
      eprint={2411.10440},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.10440},
}

🙏 Acknowledgement