ByungKwanLee / Meteor

[NeurIPS 2024] Official PyTorch implementation code for realizing the technical part of Mamba-based traversal of rationale (Meteor) to improve performance of numerous vision language performances for diverse capabilities.
MIT License
101 stars 4 forks source link

Meteor: Mamba-based traversal of rationale for Large Language and Vision Models [ArXiv]

πŸ“° News

ezgif-1-389577e9b3

Official PyTorch implementation code for realizing the technical part of Mamba-based traversal of rationale (Meteor) to improve numerous vision language performances with efficient model size. This code is developed from scratch. so I have been trying to improve the readibility and simplicity of the code, compared with LLaVA which has relatively complexly structured code.

The contributions of Meteor can be simply summarized as the following lists

πŸ’‘ Highlights

Open-source LLVMs with Standard Model Size

LLVMs SQA-IMG POPE MME MMB MathVista SEED-IMG MM-Vet LLaVA-W
Yi-VL-6B 71.7 82.5 1915 64.2 29.7 67.5 32.1 51.9
LLaVA-NeXT-7B 70.1 86.5 1851 69.6 34.6 70.2 43.9 72.3
MM1-7B 72.6 86.6 1858 72.3 35.9 70.9 42.1 -
Meteor-7B 88.3 88.7 2229 82.9 53.4 75.0 57.3 87.1

Open-source LLVMs with Large Model Sizes

LLVMs AI2D ChartQA MME MMB MathVista MM-Vet LLaVA-W
InternVL1.5-40B 79.0 68.0 2175 82.2 47.7 48.9 -
InternVL1.5-26B 80.7 83.8 2188 82.2 53.5 62.8 -
MM1-30B - - 2069 75.1 39.4 48.7 -
MiniGemini-34B - - 2105 79.6 38.9 53.0 -
MiniGemini-HD-34B - - 2141 80.6 43.3 59.3 -
LLaVA-NeXT-8B 71.6 69.5 1972 72.1 37.5 - 80.1
LLaVA-NeXT-34B 74.9 68.7 2030 79.3 46.0 57.4 88.8
LLaVA-NeXT-72B 77.4 77.0 2159 80.5 46.6 - 89.2
LLaVA-NeXT-110B 80.4 80.4 2201 80.5 49.0 - 90.4
Meteor-7B 77.9 74.9 2229 82.9 53.4 57.3 87.1

Closed-source LLVMs

LLVMs SQA-IMG AI2D ChartQA MME MMB MathVista SEED-IMG MMStar
Qwen-VL-Plus 71.6 75.9 78.1 2183 67.0 43.3 72.7 39.7
Gemini-Pro 80.1 73.9 74.1 1933 73.6 45.2 70.7 41.6
GPT-4V 84.6 78.2 78.5 1927 77.0 49.9 69.1 46.1
Meteor-7B 88.3 77.9 74.9 2229 82.9 53.4 75.0 52.8

😎 How to run demo?

Run the following order.

bash install
pip install -r requirements.txt

and run the demo (Enjoy Meteor).

python demo.py

(Optional) If you want to make πŸ“» Gradio demo by yourself, then you should run the following file or change it to fit your style.

python app.py

(Optional) If you want to enjoy the curated question-ratinale-answer triples, then you should debug the following file.

python check_dataset.py

(Optional) If you want to conduct the vision language evaluation, then you should run the following file.

bash run

πŸ“‹ Gathered & Curated Dataset Description

Gathered Total: 2130830, 2.1M

------------------------------
* Real-World Image: 755k
* Document & Chart & Diagram & Sign & Symbol: 627k
* Math: 747k
    - Math with Vision: 180k
    - Math with Text only: 566k
------------------------------

- ShareGPT4V-Caption [without SAM] (91021, 91k)
- ShareGPT4V-Instruction [Without few samples of OCR-VQA] (664703, 664k)
- MiniGemini-Instruction [DocVQA, ChartQA, DVQA, AI2D] (27670, 27k)
- DocDownstream (574268, 574k)
- DocReason (25877, 25k)
- GLLaVA-Align (60252, 60k)
- GLLaVA-QA (117205, 117k)
- MathVision (3040, 3k)
- MathInstruct [TextOnlyDataset] (262040, 262k)
- MathPlus [TextOnlyDataset] (304754, 304k)

Curated Total: 1059382, 1.1M

--------------------------------------------
Real-World Image: 338K
Document & Chart & Diagram & Sign & Symbol: 379K
Math: 342K
     Math with Vision: 165K
     Math with Text only: 177K
--------------------------------------------

- ShareGPT4V-Caption (72507, 73K)
- ShareGPT4V-Instruction (266072, 266K)
- MiniGemini-Instruction (26885, 27K)
- DocDownstream (298748, 299K)
- DocReason (53065, 53K)
- GLLaVA (162378, 162K)
- MathVision (2992, 3K)
- MathInstruct (81496, 81K)
- MathPlus (95239, 95K)

πŸš€ Download Training Datasets

We collect the following eight datasets. For MiniGemini, we selectively use data samples only for DocVQA, ChartQA, DVQA, and AI2D. Therefore, it is no need for you to download all data samples for MiniGemini.

Gathered Dataset Layout

Meteor_Dataset_Path
β”œβ”€β”€ llava                                                       # ShareGPT4V
β”‚   └── llava_pretrain                  
β”‚       └── images                  
β”œβ”€β”€ coco                                                        # ShareGPT4V
β”‚   └── train2017                   
β”œβ”€β”€ sam                                                         # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ gqa                                                         # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ ocr_vqa                                                     # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ textvqa                                                     # ShareGPT4V
β”‚   └── train_images                    
β”œβ”€β”€ vg                                                          # ShareGPT4V
β”‚   β”œβ”€β”€ VG_100K                 
β”‚   └── VG_100K_2                   
β”œβ”€β”€ share_textvqa                                               # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ web-celebrity                                               # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ web-landmark                                                # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ wikiart                                                     # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ share_textvqa                                               # ShareGPT4V
β”‚   └── images                  
β”œβ”€β”€ docvqa                                                      # MiniGemini
β”‚   └── images                  
β”œβ”€β”€ chartqa                                                     # MiniGemini
β”‚   └── train                   
β”‚       └── images                  
β”œβ”€β”€ dvqa                                                        # MiniGemini
β”‚   └── images                  
β”œβ”€β”€ ai2d                                                        # MiniGemini
β”‚   └── images                  
β”œβ”€β”€ imgs                                                        # DocDownstream & DocReason
β”‚   └── ChartQA
β”‚   └── DUE_Benchmark
β”‚       └── DeepForm
β”‚       └── DocVQA
β”‚       └── InfographicsVQA
β”‚       └── KleisterCharity
β”‚       └── TabFact
β”‚       └── WikiTableQuestions
β”‚   └── TextCaps
β”‚   └── TextVQA
β”‚   └── VisualMRC
β”œβ”€β”€ geo3k                                                       # GLLaVA
|   └── train
β”œβ”€β”€ geoqa_plus                                                  # GLLaVA
β”œβ”€β”€ images                                                      # MathVision
|
β”œβ”€β”€ sharegpt4v_instruct_gpt4-vision_cap100k.json                # ShareGPT4V-Caption
β”œβ”€β”€ sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json  # ShareGPT4V-Instruction
β”œβ”€β”€ train.jsonl                                                 # DocDownstream
β”œβ”€β”€ detailed_explanation.jsonl                                  # DocReason
β”œβ”€β”€ minigemini_instruction.json                                 # MiniGemini-Instruction
β”œβ”€β”€ gllava_align.parquet                                        # GLLaVA-Align
β”œβ”€β”€ gllava_qa.parquet                                           # GLLaVA-QA
β”œβ”€β”€ mathvision.parquet                                          # MathVision
β”œβ”€β”€ MathInstruct.json                                           # MathInstruct
└── mathplus.parquet                                            # MathPlus

πŸ“‚ Evaluation Benchmarks

These are the list of evaluation datasets. If you completely download them, the dataset should be placed in the folder by the following below directory layout.

Evaluation Dataset Directory Layout

Evaluation_Dataset_Path
β”œβ”€β”€ LLVisionQA-QBench               # Q-Bench
β”œβ”€β”€ ScienceQA                       # SQA-IMG
β”œβ”€β”€ ai2d                            # AI2D
β”œβ”€β”€ chartqa                         # ChartQA
β”œβ”€β”€ SEED-Bench                      # SEED-IMG
β”œβ”€β”€ POPE                            # POPE
β”œβ”€β”€ HallusionBench                  # HallusionBench
β”œβ”€β”€ MME_Benchmark_release_version   # MME
β”œβ”€β”€ MathVista                       # MathVista
β”œβ”€β”€ MMBench                         # MMB
β”œβ”€β”€ mm-vet                          # MM-Vet
β”œβ”€β”€ llava-bench-in-the-wild         # LLaVA Bench in the Wild
β”œβ”€β”€ MMStar                          # MMStar
└── MathVerse                       # MathVerse