This project was previously named LLaVA-o1. We have renamed it to LLaVA-CoT.
LLaVA-CoT is the first visual language model capable of spontaneous, systematic reasoning, similar to GPT-o1!
Our 11B model outperforms Gemini-1.5-pro,GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on six challenging multimodal benchmarks.
LLaVA-CoT begins by outlining the problem, interprets relevant information from the image, proceeds step-by-step through reasoning, and ultimately reaches a well-supported conclusion.
Question | Subtract all tiny shiny balls. Subtract all purple objects. How many objects are left? Options: A. 4, B. 8, C. 2, D. 6 |
---|---|
Llama-3.2-11B-Vision-Instruct | To determine the correct answer, we need to follow the instructions provided. The first instruction is to subtract all tiny shiny balls. Since there are three tiny shiny balls in the image, we would be left with no shiny balls after performing this action. The second instruction is to subtract all purple objects. However, upon closer inspection, it becomes apparent that there are no shiny purple objects present in the image. Therefore, we cannot subtract any purple objects. Given that we have already removed all the tiny shiny balls and there were no purple objects to begin with, the correct answer is C. 2, as initially, there were 2 purple objects, but they are not to be subtracted according to the instruction. |
LLaVA-CoT | Summary stage: What's the problem? What should I do? <SUMMARY> I will examine the image to identify and count all the objects, focusing on the tiny shiny balls and purple objects. Then, I will subtract these objects from the total count and determine the number of remaining objects. </SUMMARY> Caption stage: What can I know from the image? <CAPTION> The image shows various 3D geometric shapes on a flat surface, including spheres, cubes, and cylinders in different colors. Notably, there are small shiny green spheres and a purple cylinder. </CAPTION> Reasoning stage: How to solve the problem step-by-step? <REASONING> First, I will count all the objects in the image: there are 10 objects in total. Next, I will identify and subtract the tiny shiny balls. There is one small shiny green sphere. Then, I will identify and subtract the purple objects. There is one purple cylinder. By subtracting the two objects (the shiny ball and the purple cylinder) from the total, I find that 8 objects remain. </REASONING> Conclusion stage: What is the final answer? <CONCLUSION> B </CONCLUSION> |
Question | An acetal polymer block is fixed to the rigid plates at its top and bottom surfaces. If the top plate displaces 2 mm horizontally when it is subjected to a horizontal force ( P = 2 \, \text{kN} ), determine the shear modulus of the polymer. The width of the block is 100 mm. Assume that the polymer is linearly elastic and use small angle analysis. Options: A: 15 MPa, B: 5 MPa, C: 20 MPa, D: 2 MPa |
---|---|
Llama-3.2-11B-Vision-Instruct | To determine the shear modulus of the polymer, we can use the formula for shear modulus in a small angle analysis: ( G = \frac{F}{A \gamma} ) where ( G ) is the shear modulus, ( F ) is the applied force, ( A ) is the area of the block, and ( \gamma ) is the shear strain. Given that the top plate of the block displaces horizontally by 2 mm when subjected to a horizontal force of 2 kN, we can calculate the shear strain ( \gamma ). Since the displacement is horizontal and the original length of the block is 400 mm, the shear strain ( \gamma ) is: [ \gamma = \frac{\Delta x}{L} = \frac{2 \, \text{mm}}{400 \, \text{mm}} = 0.005 ] Next, we calculate the area ( A ) of the block. The width of the block is given as 100 mm, and assuming the height of the block is also 200 mm (as given in the image analysis), the area ( A ) is: [ A = 100 \, \text{mm} \times 200 \, \text{mm} = 20000 \, \text{mm}^2 ] Now, we can substitute these values into the formula for shear modulus: [ G = \frac{F}{A \gamma} = \frac{2000 \, \text{N}}{20000 \, \text{mm}^2 \times 0.005} = \frac{2000}{100} = 20 \, \text{MPa} ] Therefore, the correct answer is C: 20 MPa. |
LLaVA-CoT | Summary stage: What's the problem? What should I do? <SUMMARY> To solve the problem, I will determine the shear modulus of the acetal polymer block using the relationship between the applied force, displacement, and dimensions of the block. I will use the formula for shear modulus and calculate the result. </SUMMARY> Caption stage: What can I know from the image? <CAPTION> The image shows a rectangular block of acetal polymer fixed between two rigid plates. The top plate moves horizontally by 2 mm when a force ( P = 2 \, \text{kN} ) is applied. The block's dimensions are given as 400 mm in length, 200 mm in height, and 100 mm in width. </CAPTION> Reasoning stage: How to solve the problem step-by-step? <REASONING> The shear modulus ( G ) of the polymer can be calculated using the formula: [ G = \frac{F}{A \cdot \frac{\Delta x}{L}} ] where: - ( F ) is the applied force (2 kN), - ( A ) is the area of the block's top surface (100 mm x 400 mm = 40,000 mm²), - ( \Delta x ) is the horizontal displacement (2 mm), - ( L ) is the height of the block (200 mm). First, convert the force to Newtons: ( F = 2 \, \text{kN} = 2000 \, \text{N} ). Next, calculate the shear modulus: [ G = \frac{2000}{40,000 \cdot \frac{2}{200}} ] [ G = \frac{2000}{40,000 \cdot 0.01} ] [ G = \frac{2000}{400} ] [ G = 5 \, \text{MPa} ] </REASONING> Conclusion stage: What is the final answer? <CONCLUSION> B: 5 MPa </CONCLUSION> |
Stay tuned! Our code, dataset, and pretrain weights are coming soon.
If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:
@misc{xu2024llavao1letvisionlanguage,
title={LLaVA-o1: Let Vision Language Models Reason Step-by-Step},
author={Guowei Xu and Peng Jin and Li Hao and Yibing Song and Lichao Sun and Li Yuan},
year={2024},
eprint={2411.10440},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.10440},
}