Official PyTorch implementation code for realizing the technical part of Phantom of Latent to improve numerous vision language performances with efficient model size. This code is developed from scratch, where the model architecture and all configurations are inspired by InternVL. I have been trying to improve the readibility and simplicity of the code, compared with LLaVA which has relatively complexly structured code.
Figure1.
Figure2.
Figure3.
import torch
from config import *
from PIL import Image
from utils.utils import *
from model.load_model import load_model
from torchvision.transforms.functional import pil_to_tensor
# model selection
size = '7b' # [Select One] '0.5b' (transformers more recent version) | '1.8b' | '3.8b' (transformers==4.37.2) | '7b'
# User prompt
prompt_type="with_image" # Select one option "text_only", "with_image"
img_path='figures/demo.png'
question="Describe the image in detail"
# loading model
model, tokenizer = load_model(size=size)
# prompt type -> input prompt
if prompt_type == 'with_image':
# Image Load
image = pil_to_tensor(Image.open(img_path).convert("RGB"))
inputs = [{'image': image, 'question': question}]
elif prompt_type=='text_only':
inputs = [{'question': question}]
# cpu -> gpu
for param in model.parameters():
if not param.is_cuda:
param.data = param.cuda()
# Generate
with torch.inference_mode():
# Model
_inputs = model.eval_process(inputs=inputs,
data='demo',
tokenizer=tokenizer,
device='cuda:0')
generate_ids = model.generate(**_inputs, do_sample=False, max_new_tokens=256)
answer = tokenizer.batch_decode(generate_ids, skip_special_tokens=True)[0]
print(answer)
Dataset Description (Total: 2852771, 2.8M)
------------------------------
* Real-World Image: 1218630, 1.2M
* Real-World Text: 143000, 143K
* Document & Chart & Diagram & Sign & Symbol: 743850, 744k
* Math: 747291, 747k
- Math with Vision: 180497, 180k
- Math with Text only: 566794, 566k
------------------------------
- ShareGPT4O-Images (57289, 57k)
- ShareGPT4V-Caption [without SAM] (91021, 91k)
- ShareGPT4V-Instruction [Without few samples of OCR-VQA] (664703, 664k)
- ALLAVA4V-VFLAN based on MiniGemini-Pretrain/Instruct (405617, 405k)
- ALLAVA4V-Text (143000, 143k)
- MiniGemini-Instruction [DocVQA, ChartQA, DVQA, AI2D] (27670, 27k)
- SMR [ArXivQA, TextbookQA] (116035, 116K)
- DocDownstream (574268, 574k)
- DocReason (25877, 25k)
- GLLaVA-Align (60252, 60k)
- GLLaVA-QA (117205, 117k)
- MathVision (3040, 3k)
- MathInstruct [TextOnlyDataset] (262040, 262k)
- MathPlus [TextOnlyDataset] (304754, 304k)
Dataset Description (Total 2040186, 2.0M)
--------------------------------------------
* Real-World Image: 871160, 871k
* Real-World Text: 102389, 102k
* Document & Chart & Diagram & Sign & Symbol: 529709, 529k
* Math: 536928, 536k
- Math with Vision: 129694, 129k
- Math with Text only: 407234, 407k
--------------------------------------------
- ShareGPT4O-Images (40106, 40k)
- ShareGPT4V-Caption [without SAM] (64925, 64k)
- ShareGPT4V-Instruction [Without few samples of OCR-VQA] (475669, 475k)
- ALLAVA4V-VFLAN based on MiniGemini-Pretrain/Instruct (290460, 290k)
- ALLAVA4V-Text (102389, 102k)
- MiniGemini-Instruction [DocVQA, ChartQA, DVQA, AI2D] (19363, 19k)
- SMR [ArXivQA, TextbookQA] (82843, 82K)
- DocDownstream (409140, 409k)
- DocReason (18363, 18k)
- GLLaVA (127484, 127k)
- MathVision (2210, 2k)
- MathInstruct [TextOnlyDataset] (188288, 188k)
- MathPlus [TextOnlyDataset] (218946, 218k)
We collect the following eight datasets. For MiniGemini, we selectively use data samples only for DocVQA, ChartQA, DVQA, and AI2D. Therefore, it is no need for you to download all data samples for MiniGemini.
Gathered Dataset Layout
Phantom_Dataset_Path
├── llava # ShareGPT4V
│ └── llava_pretrain
│ └── images
├── coco # ShareGPT4V
│ └── train2017
├── sam # ShareGPT4V
│ └── images
├── gqa # ShareGPT4V
│ └── images
├── ocr_vqa # ShareGPT4V
│ └── images
├── textvqa # ShareGPT4V
│ └── train_images
├── vg # ShareGPT4V
│ ├── VG_100K
│ └── VG_100K_2
├── share_textvqa # ShareGPT4V
│ └── images
├── web-celebrity # ShareGPT4V
│ └── images
├── web-landmark # ShareGPT4V
│ └── images
├── wikiart # ShareGPT4V
│ └── images
├── share_textvqa # ShareGPT4V
│ └── images
├── docvqa # MiniGemini
│ └── images
├── chartqa # MiniGemini
│ └── train
│ └── images
├── dvqa # MiniGemini
│ └── images
├── ai2d # MiniGemini
│ └── images
├── ALLaVA-4V # MiniGemini (ALLAVA-VFLAN)
│ └── allava_vflan
│ └── images
├── arxivqa # SMR (ArXivQA)
│ └── images
├── TextbookQA # SMR (TextbookQA)
│ └── train
│ └── val
├── imgs # DocDownstream & DocReason
│ └── ChartQA
│ └── DUE_Benchmark
│ └── DeepForm
│ └── DocVQA
│ └── InfographicsVQA
│ └── KleisterCharity
│ └── TabFact
│ └── WikiTableQuestions
│ └── TextCaps
│ └── TextVQA
│ └── VisualMRC
├── geo3k # GLLaVA
| └── train
├── geoqa_plus # GLLaVA
├── images # MathVision
|
├── sharegpt4v_instruct_gpt4-vision_cap100k.json # ShareGPT4V-Caption
├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json # ShareGPT4V-Instruction
├── Evol-Instruct-GPT4-Turbo-143K.json # ALLAVA4V-Text
├── SMR.json # SMR
├── train.jsonl # DocDownstream
├── detailed_explanation.jsonl # DocReason
├── minigemini_pretrain.json # MiniGemini-Pretrain
├── minigemini_instruction.json # MiniGemini-Instruction
├── gllava_align.parquet # GLLaVA-Align
├── gllava_qa.parquet # GLLaVA-QA
├── mathvision.parquet # MathVision
├── MathInstruct.json # MathInstruct
└── mathplus.parquet # MathPlus
These are the list of evaluation datasets. If you completely download them, the dataset should be placed in the folder by the following below directory layout.
Evaluation Dataset Directory Layout
Evaluation_Dataset_Path
├── ScienceQA # SQA-IMG
├── ai2d # AI2D
├── chartqa # ChartQA
├── SEED-Bench # SEED-IMG
├── SEED-Bench-2-plus # SEED-Bench-2-Plus
├── POPE # POPE
├── HallusionBench # HallusionBench
├── MME_Benchmark_release_version # MME
├── MathVista # MathVista
├── MMBench # MMB
├── mm-vet # MM-Vet
├── mm-vet-v2 # MM-Vet-v2
├── llava-bench-in-the-wild # LLaVA Bench in the Wild
├── LLaVA-Bench-Wilder # LLaVA Wilder
├── BLINK # BLINK
├── CV-Bench # CV-Bench
├── VisualWebBench # VisualWebBench
├── MMStar # MMStar
└── MathVerse # MathVerse