Official PyTorch implementation code for realizing the technical part of Mamba-based traversal of rationale (Meteor) to improve numerous vision language performances with efficient model size. This code is developed from scratch. so I have been trying to improve the readibility and simplicity of the code, compared with LLaVA which has relatively complexly structured code.
The contributions of Meteor can be simply summarized as the following lists
Open-source LLVMs with Standard Model Size
LLVMs | SQA-IMG | POPE | MME | MMB | MathVista | SEED-IMG | MM-Vet | LLaVA-W |
---|---|---|---|---|---|---|---|---|
Yi-VL-6B | 71.7 | 82.5 | 1915 | 64.2 | 29.7 | 67.5 | 32.1 | 51.9 |
LLaVA-NeXT-7B | 70.1 | 86.5 | 1851 | 69.6 | 34.6 | 70.2 | 43.9 | 72.3 |
MM1-7B | 72.6 | 86.6 | 1858 | 72.3 | 35.9 | 70.9 | 42.1 | - |
Meteor-7B | 88.3 | 88.7 | 2229 | 82.9 | 53.4 | 75.0 | 57.3 | 87.1 |
Open-source LLVMs with Large Model Sizes
LLVMs | AI2D | ChartQA | MME | MMB | MathVista | MM-Vet | LLaVA-W |
---|---|---|---|---|---|---|---|
InternVL1.5-40B | 79.0 | 68.0 | 2175 | 82.2 | 47.7 | 48.9 | - |
InternVL1.5-26B | 80.7 | 83.8 | 2188 | 82.2 | 53.5 | 62.8 | - |
MM1-30B | - | - | 2069 | 75.1 | 39.4 | 48.7 | - |
MiniGemini-34B | - | - | 2105 | 79.6 | 38.9 | 53.0 | - |
MiniGemini-HD-34B | - | - | 2141 | 80.6 | 43.3 | 59.3 | - |
LLaVA-NeXT-8B | 71.6 | 69.5 | 1972 | 72.1 | 37.5 | - | 80.1 |
LLaVA-NeXT-34B | 74.9 | 68.7 | 2030 | 79.3 | 46.0 | 57.4 | 88.8 |
LLaVA-NeXT-72B | 77.4 | 77.0 | 2159 | 80.5 | 46.6 | - | 89.2 |
LLaVA-NeXT-110B | 80.4 | 80.4 | 2201 | 80.5 | 49.0 | - | 90.4 |
Meteor-7B | 77.9 | 74.9 | 2229 | 82.9 | 53.4 | 57.3 | 87.1 |
Closed-source LLVMs
LLVMs | SQA-IMG | AI2D | ChartQA | MME | MMB | MathVista | SEED-IMG | MMStar |
---|---|---|---|---|---|---|---|---|
Qwen-VL-Plus | 71.6 | 75.9 | 78.1 | 2183 | 67.0 | 43.3 | 72.7 | 39.7 |
Gemini-Pro | 80.1 | 73.9 | 74.1 | 1933 | 73.6 | 45.2 | 70.7 | 41.6 |
GPT-4V | 84.6 | 78.2 | 78.5 | 1927 | 77.0 | 49.9 | 69.1 | 46.1 |
Meteor-7B | 88.3 | 77.9 | 74.9 | 2229 | 82.9 | 53.4 | 75.0 | 52.8 |
Run the following order.
bash install
pip install -r requirements.txt
and run the demo (Enjoy Meteor).
python demo.py
(Optional) If you want to make π» Gradio demo by yourself, then you should run the following file or change it to fit your style.
python app.py
(Optional) If you want to enjoy the curated question-ratinale-answer triples, then you should debug the following file.
python check_dataset.py
(Optional) If you want to conduct the vision language evaluation, then you should run the following file.
bash run
Gathered Total: 2130830, 2.1M
------------------------------
* Real-World Image: 755k
* Document & Chart & Diagram & Sign & Symbol: 627k
* Math: 747k
- Math with Vision: 180k
- Math with Text only: 566k
------------------------------
- ShareGPT4V-Caption [without SAM] (91021, 91k)
- ShareGPT4V-Instruction [Without few samples of OCR-VQA] (664703, 664k)
- MiniGemini-Instruction [DocVQA, ChartQA, DVQA, AI2D] (27670, 27k)
- DocDownstream (574268, 574k)
- DocReason (25877, 25k)
- GLLaVA-Align (60252, 60k)
- GLLaVA-QA (117205, 117k)
- MathVision (3040, 3k)
- MathInstruct [TextOnlyDataset] (262040, 262k)
- MathPlus [TextOnlyDataset] (304754, 304k)
Curated Total: 1059382, 1.1M
--------------------------------------------
Real-World Image: 338K
Document & Chart & Diagram & Sign & Symbol: 379K
Math: 342K
Math with Vision: 165K
Math with Text only: 177K
--------------------------------------------
- ShareGPT4V-Caption (72507, 73K)
- ShareGPT4V-Instruction (266072, 266K)
- MiniGemini-Instruction (26885, 27K)
- DocDownstream (298748, 299K)
- DocReason (53065, 53K)
- GLLaVA (162378, 162K)
- MathVision (2992, 3K)
- MathInstruct (81496, 81K)
- MathPlus (95239, 95K)
We collect the following eight datasets. For MiniGemini, we selectively use data samples only for DocVQA, ChartQA, DVQA, and AI2D. Therefore, it is no need for you to download all data samples for MiniGemini.
Gathered Dataset Layout
Meteor_Dataset_Path
βββ llava # ShareGPT4V
β βββ llava_pretrain
β βββ images
βββ coco # ShareGPT4V
β βββ train2017
βββ sam # ShareGPT4V
β βββ images
βββ gqa # ShareGPT4V
β βββ images
βββ ocr_vqa # ShareGPT4V
β βββ images
βββ textvqa # ShareGPT4V
β βββ train_images
βββ vg # ShareGPT4V
β βββ VG_100K
β βββ VG_100K_2
βββ share_textvqa # ShareGPT4V
β βββ images
βββ web-celebrity # ShareGPT4V
β βββ images
βββ web-landmark # ShareGPT4V
β βββ images
βββ wikiart # ShareGPT4V
β βββ images
βββ share_textvqa # ShareGPT4V
β βββ images
βββ docvqa # MiniGemini
β βββ images
βββ chartqa # MiniGemini
β βββ train
β βββ images
βββ dvqa # MiniGemini
β βββ images
βββ ai2d # MiniGemini
β βββ images
βββ imgs # DocDownstream & DocReason
β βββ ChartQA
β βββ DUE_Benchmark
β βββ DeepForm
β βββ DocVQA
β βββ InfographicsVQA
β βββ KleisterCharity
β βββ TabFact
β βββ WikiTableQuestions
β βββ TextCaps
β βββ TextVQA
β βββ VisualMRC
βββ geo3k # GLLaVA
| βββ train
βββ geoqa_plus # GLLaVA
βββ images # MathVision
|
βββ sharegpt4v_instruct_gpt4-vision_cap100k.json # ShareGPT4V-Caption
βββ sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json # ShareGPT4V-Instruction
βββ train.jsonl # DocDownstream
βββ detailed_explanation.jsonl # DocReason
βββ minigemini_instruction.json # MiniGemini-Instruction
βββ gllava_align.parquet # GLLaVA-Align
βββ gllava_qa.parquet # GLLaVA-QA
βββ mathvision.parquet # MathVision
βββ MathInstruct.json # MathInstruct
βββ mathplus.parquet # MathPlus
These are the list of evaluation datasets. If you completely download them, the dataset should be placed in the folder by the following below directory layout.
Evaluation Dataset Directory Layout
Evaluation_Dataset_Path
βββ LLVisionQA-QBench # Q-Bench
βββ ScienceQA # SQA-IMG
βββ ai2d # AI2D
βββ chartqa # ChartQA
βββ SEED-Bench # SEED-IMG
βββ POPE # POPE
βββ HallusionBench # HallusionBench
βββ MME_Benchmark_release_version # MME
βββ MathVista # MathVista
βββ MMBench # MMB
βββ mm-vet # MM-Vet
βββ llava-bench-in-the-wild # LLaVA Bench in the Wild
βββ MMStar # MMStar
βββ MathVerse # MathVerse