Scale up visual instruction tuning to millions by GPT-4.
π arXiv | π€ Data | π€ Data | β¨ Models
We Scale up Visual Instruction Tuning (SVIT) by constructing a dataset of 4.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs, 1.6M complex reasoning QA pairs, 1.0M referring QA pairs and 106K detailed image description, by prompting GPT-4 with the abundant manual annotations of image.
Dataset | Image | Object BBox | Region Description | Image Caption | Instruction Question | Response Answer | GPT |
---|---|---|---|---|---|---|---|
MiniGPT-4 | 3.5K | - | - | - | 4 | 3.5K | GPT-3.5 |
LLaVAR* | 16K | - | - | - | 16K | 16K | GPT-4 |
LLaVA | 81.5K | 600K | - | 404.7K | 150K | 150K | GPT-4 |
SVIT | 108.1K | 3.8M | 5.4M | 257.6K | 4.2M | 4.2M | GPT-4 |
*LLaVAR collects 422K noisy instruction-following data using OCR results and 16K high-quality data using GPT-4.
Checkpoint | Data | Schedule | MME perception | MME cognition | MMBench | MMBench-Chinese | SEED-Bench-1 | MMMU | VQA-v2 | GQA | VisWiz | ScienceQA-IMG | TextVQA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SVIT-v1.5-LoRA | SVIT-mix-665K | lora-1e | 1560.3 | 364.3 | 68.3 | 63.2 | 61.8 | 34.1 | 80.1 | 63.4 | 56.7 | 69.9 | 61.1 |
SVIT-v1.5-Full | SVIT-mix-665K | full_ft-1e | 1565.8 | 323.2 | 69.1 | 63.1 | 61.9 | 33.3 | 80.3 | 64.1 | 56.4 | 70.0 | 60.8 |
The above models are trained on LLaVA-v1.5's architecture. Please follow LLaVA to set up the code and evaluate the models.
Specifically for training, please refer to the visual instruction tuning stage of LLaVA-v1.5, you should just replace LLaVA-v1.5-mix-665K with our SVIT-mix-665K and keep all others remaining.
We build SVIT based on Visual Genome dataset that comprises 108,077 images with dense annotations within each image, including region descriptions, objects, attributes, relationships etc. Since Visual Genome is partially sourced from MS-COCO, we also collect captions for images from MS-COCO. Leveraging these annotations, we are able to gather thorough and detailed descriptions for the images, including: (1) the 257,633 captions from MS-COCO; (2) the 3,802,374 object names and their corresponding bounding boxes from Visual Genome; (3) the 5,406,592 region descriptions from Visual Genome.
Inspired by LLaVA, we design four tasks and prompt the language-only GPT-4 ChatBot to generate the questions and answers accordingly. The prompts are summarized in this folder.
For rich diversity, we randomly sample an instruction for detail description task, e.g., "can you describe the image in detail". The complete list of the alternative instructions can be found in this file.
We employ the open-source Multimodal Large Language Model - LLaVA, which consists of a vision encoder, a large language model and a vision-language connector. We illustrate the model in Figure 1.
If you find this repository helpful, please cite the paper below.
@article{zhao2023svit,
title={SVIT: Scaling up Visual Instruction Tuning},
author={Zhao, Bo and Wu, Boya and He, Muyang and Huang, Tiejun},
journal={arXiv preprint arXiv:2307.04087},
year={2023}
}