InternLM-XComposer2
Thanks the community for HuggingFace Demo | OpenXLab Demo of InternLM-XComposer2.
๐ join us on Discord and WeChat
InternLM-XComposer2-
: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
ShareGPT4V: Improving Large Multi-modal Models with Better Captions
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models
InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) based on InternLM2-7B excelling in free-form text-image composition and comprehension. It boasts several amazing capabilities and applications:
Free-form Interleaved Text-Image Composition: InternLM-XComposer2 can effortlessly generate coherent and contextual articles with interleaved images following diverse inputs like outlines, detailed text requirements and reference images, enabling highly customizable content creation.
Accurate Vision-language Problem-solving: InternLM-XComposer2 accurately handles diverse and challenging vision-language Q&A tasks based on free-form instructions, excelling in recognition, perception, detailed captioning, visual reasoning, and more.
Awesome performance: InternLM-XComposer2 based on InternLM2-7B not only significantly outperforms existing open-source multimodal models in 13 benchmarks but also matches or even surpasses GPT-4V and Gemini Pro in 6 benchmarks
InternLM-XComposer2-4KHD could further understand 4K Resolution images.
We release InternLM-XComposer2 series in three versions:
InternLM-XComposer2-4KHD-7B ๐ค: The high-resolution multi-task trained VLLM model with InternLM-7B as the initialization of the LLM for High-resolution understanding, VL benchmarks and AI assistant.
InternLM-XComposer2-VL-7B ๐ค : The multi-task trained VLLM model with InternLM-7B as the initialization of the LLM for VL benchmarks and AI assistant. It ranks as the most powerful vision-language model based on 7B-parameter level LLMs, leading across 13 benchmarks.
InternLM-XComposer2-VL-1.8B ๐ค : A lightweight version of InternLM-XComposer2-VL based on InternLM-1.8B.
InternLM-XComposer2-7B ๐ค: The further instruction tuned VLLM for Interleaved Text-Image Composition with free-form inputs.
Please refer to Technical Report and 4KHD Technical Reportfor more details.
https://github.com/InternLM/InternLM-XComposer/assets/22662425/fdb89a38-c650-45f2-b5b7-51182e89a5cc
Please refer to Chinese Demo for the demo of the Chinese version.
2024.04.22
๐๐๐ The finetune code of InternLM-XComposer2-VL-7B-4KHD-7B are publicly available.
2024.04.09
๐๐๐ InternLM-XComposer2-4KHD-7B and evaluation code are publicly available.
2024.04.09
๐๐๐ InternLM-XComposer2-VL-1.8B is publicly available.
2024.02.22
๐๐๐ We release DualFocus, a framework for integrating macro and micro perspectives within MLLMs to enhance vision-language task performance.
2024.02.06
๐๐๐ InternLM-XComposer2-7B-4bit and InternLM-XComposer-VL2-7B-4bit are publicly available on Hugging Face and ModelScope.
2024.02.02
๐๐๐ The finetune code of InternLM-XComposer2-VL-7B are publicly available.
2024.01.26
๐๐๐ The evaluation code of InternLM-XComposer2-VL-7B are publicly available.
2024.01.26
๐๐๐ InternLM-XComposer2-7B and InternLM-XComposer-VL2-7B are publicly available on Hugging Face and ModelScope.
2024.01.26
๐๐๐ We release a technical report for more details of InternLM-XComposer2 series.
2023.11.22
๐๐๐ We release the ShareGPT4V, a large-scale highly descriptive image-text dataset generated by GPT4-Vision and a superior large multimodal model, ShareGPT4V-7B.
2023.10.30
๐๐๐ InternLM-XComposer-VL achieved the top 1 ranking in both Q-Bench and Tiny LVLM.
2023.10.19
๐๐๐ Support for inference on multiple GPUs. Two 4090 GPUs are sufficient for deploying our demo.
2023.10.12
๐๐๐ 4-bit demo is supported, model files are available in Hugging Face and ModelScope.
2023.10.8
๐๐๐ InternLM-XComposer-7B and InternLM-XComposer-VL-7B are publicly available on ModelScope.
2023.9.27
๐๐๐ The evaluation code of InternLM-XComposer-VL-7B are publicly available.
2023.9.27
๐๐๐ InternLM-XComposer-7B and InternLM-XComposer-VL-7B are publicly available on Hugging Face.
2023.9.27
๐๐๐ We release a technical report for more details of our model series.
Model | Usage | Transformers(HF) | ModelScope(HF) | Release Date |
---|---|---|---|---|
InternLM-XComposer2-4KHD | 4K Resolution Understanding, Benchmark, VL-Chat | ๐คinternlm-xcomposer2-4khd-7b | ![]() |
2024-04-09 |
InternLM-XComposer2-VL-1.8B | Benchmark, VL-Chat | ๐คinternlm-xcomposer2-vl-1_8b | ![]() |
2024-04-09 |
InternLM-XComposer2 | Text-Image Composition | ๐คinternlm-xcomposer2-7b | ![]() |
2024-01-26 |
InternLM-XComposer2-VL | Benchmark, VL-Chat | ๐คinternlm-xcomposer2-vl-7b | ![]() |
2024-01-26 |
InternLM-XComposer2-4bit | Text-Image Composition | ๐คinternlm-xcomposer2-7b-4bit | ![]() |
2024-02-06 |
InternLM-XComposer2-VL-4bit | Benchmark, VL-Chat | ๐คinternlm-xcomposer2-vl-7b-4bit | ![]() |
2024-02-06 |
InternLM-XComposer | Text-Image Composition, VL-Chat | ๐คinternlm-xcomposer-7b | ![]() |
2023-09-26 |
InternLM-XComposer-4bit | Text-Image Composition, VL-Chat | ๐คinternlm-xcomposer-7b-4bit | ![]() |
2023-09-26 |
InternLM-XComposer-VL | Benchmark | ๐คinternlm-xcomposer-vl-7b | ![]() |
2023-09-26 |
We evaluate InternLM-XComposer2-VL on 16 multimodal benchmarks: MMStar, DocVQA, Infographics VQA, TextVQA, ChartQA, OCRBench, MathVista, MMMU, AI2D, MME, MMBench, MMBench-CN, SEED-Bench, QBench, HallusionBench, MM-Vet.
See Evaluation Details here.
DocVQA | ChartVQA | InfoVQA | TextVQA | OCRBench | MMStar | MathVista | AI2D | MMMU | MME | MMB | MMBCN | SEEDI | QBenchT | MM-Vet | HallB | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Open-source Previous SOTA | DocOwl 1.5 | DocOwl 1.5 | DocOwl 1.5 | CogAgent | CogAgent | LLaVA-N | LLaVA-N | LLaVA-N | Int-VL | WeMM | LLaVA-N | LLaVA-N | LLaVA-N | Int-XC | CogVLM | Monkey |
8B | 8B | 8B | 18B | 18B | 35B | 35B | 35B | 40B | 6B | 35B | 35B | 35B | 8B | 17B | 10B | |
82.2 | 70.2 | 44.5 | 76.1 | 59.0 | 52.1 | 39.0 | 78.9 | 51.6 | 2,050.2 | 81.1 | 79.0 | 75.7 | 64.4 | 54.5 | 39.3 | |
GPT-4V | 88.4 | 78.5 | 75.1 | 78.0 | 51.6 | 57.1 | 47.8 | 75.5 | 56.8 | 1,926.5 | 77.0 | 74.4 | 69.1 | 74.1 | 56.8 | 46.5 |
Gemini-Pro | 88.1 | 74.1 | 75.2 | 74.6 | 68.0 | 42.6 | 45.8 | 70.2 | 47.9 | 1,933.3 | 73.6 | 74.3 | 70.7 | 70.6 | 59.2 | 45.2 |
InternLM-XComposer2-VL | 57.7 | 72.6 | 34.4 | 70.1 | 53.2 | 55.4 | 57.6 | 81.2 | 41.4 | 2,220.4 | 80.7 | 79.4 | 74.9 | 72.5 | 46.7 | 41.0 |
InternLM-XComposer2-4KHD | 90.0 | 81.0 | 68.6 | 77.2 | 67.5 | 54.1 | 57.8 | 80.9 | 39.9 | 2,204.9 | 80.2 | 77.7 | 74.7 | 71.8 | 54.9 | 40.9 |
Method | LLM | MMStar | MathVista | AI2D | MMEP | MMEC | MMB | MMBCN | SEEDI | QBenchT | MM-Vet |
---|---|---|---|---|---|---|---|---|---|---|---|
InstructBLIP | Vicuna-7B | --- | 25.3 | 40.6 | - | - | 36.0 | 23.7 | 53.4 | 55.9 | 26.2 |
Qwen-VL-Chat | Qwen-7B | 37.5 | 33.8 | 63.0 | 1,487.5 | 360.7 | 60.6 | 56.7 | 58.2 | 61.7 | 47.3 |
LLaVA-1.5 | Vicuna-13B | 13.9 | 26.1 | 61.1 | 1,531.3 | 295.4 | 67.7 | 63.6 | 68.2 | 61.4 | 35.4 |
ShareGPT4V | Vicuna-7B | 11.9 | 25.8 | 58.0 | 1,567.4 | 376.4 | 68.8 | 62.2 | 69.7 | - | 37.6 |
CogVLM-17B | Vicuna-7B | 14.9 | 34.7 | 63.3 | - | - | 65.8 | 55.9 | 68.8 | - | 54.5 |
LLaVA-XTuner | InernLM2-20B | --- | 24.6 | 65.4 | - | - | 75.1 | 73.7 | 70.2 | - | 37.2 |
Monkey | Qwen-7B | 38.3 | 34.8 | 62.5 | 1,522.4 | 401.4 | 72.4 | 67.5 | 68.9 | - | 33 |
LLaVA-Next | Vicuna-13B | 38.3 | 32.4 | 72.2 | 1,445.0 | 296.0 | 70.0 | 68.5 | 71.4 | - | 44.9 |
InternLM-XC | InernLM-7B | --- | 29.5 | 56.9 | 1,528.4 | 391.1 | 74.4 | 72.4 | 66.1 | 64.4 | 35.2 |
InternLM-XComposer2-VL | InernLM2-7B | 55.4 | 57.6 | 81.2 | 1,712.0 | 530.7 | 80.7 | 79.4 | 74.9 | 72.5 | 46.7 |
InternLM-XComposer2-4KHD | InernLM2-7B | 54.1 | 57.8 | 80.9 | 1,655.9 | 548.9 | 80.2 | 77.7 | 74.7 | 71.8 | 54.9 |
Method | LLM | MMStar | MathVista | MMMU | MMEP | MMEC | CCBench | MMB | SEEDI | MM-Vet | HallB | ChartQA | OCRBench | TextVQA | DocVQA | InfoVQA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MobileVLM | MobileLLaMA 2.7B | --- | --- | --- | 1,288.9 | --- | --- | 59.6 | --- | --- | --- | --- | --- | --- | --- | --- |
LLaVA-Phi | Phi2-2.7B | --- | --- | --- | 1,335.1 | --- | --- | 59.8 | --- | --- | --- | --- | --- | --- | --- | --- |
MoE-LLaVA | 4x Phi-2 2.7B | --- | --- | --- | 1,431.3 | --- | --- | 68.0 | --- | --- | --- | --- | --- | --- | --- | --- |
TinyLLaVA | Phi2-2.7B | 36.0 | --- | --- | 1,464.9 | --- | --- | 66.9 | --- | 32.0 | --- | --- | --- | --- | --- | --- |
InternLM-XComposer2-VL | InernLM2-1.8B | 46.3 | 48.2 | 30.1 | 1,465.9 | 420.0 | 41.4 | 72.5 | 70.4 | 30.1 | 34.4 | 57.8 | 46.0 | 65.9 | 48.3 | 24.1 |
Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries. Please refer to the installation instructions
We provide a simple example to show how to use InternLM-XComposer with ๐ค Transformers.
If you have multiple GPUs, but the memory size of each GPU is not enough to accommodate the entire model, you can split the model across multiple GPUs. First, install accelerate
using the command: pip install accelerate
. Then, execute the follows scripts for chat:
# chat with 2 GPUs
python examples/example_chat.py --num_gpus 2
If InternLM-XComposer2 model inference optimization is required, we recommend using LMDeploy.
In the following subsections, we will introduce the usage of LMDeploy with the internlm-xcomposer2-4khd-7b model as an example.
First of all, please install the pypi package with pip install lmdeploy
. By default, it depends on CUDA 12.x. For a CUDA 11.x environment, please refer to the installation guide.
from lmdeploy import pipeline
from lmdeploy.vl import load_image
pipe = pipeline('internlm/internlm-xcomposer2-4khd-7b')
image = load_image('examples/4khd_example.webp')
response = pipe(('describe this image', image))
print(response)
For more on using the VLM pipeline, including multi-image inference or multi-turn chat, please overview this guide.
LMDeploy supports one-click packaging of the InternLM-XComposer2 model into an OpenAI service, providing seamless integration with the OpenAI API.
The service can be launched by one command as below:
lmdeploy serve api_server internlm/internlm-xcomposer2-4khd-7b
The arguments of api_server
can be viewed through the command lmdeploy serve api_server -h
, for instance, --tp
to set tensor parallelism, --session-len
to specify the max length of the context window, --cache-max-entry-count
to adjust the GPU mem ratio for k/v cache etc.
For more details, including service startup with docker, RESTful API information, and openai integration methods, please refer to this guide.
We provide 4-bit quantized models to ease the memory requirement of the models. To run the 4-bit models (GPU memory >= 12GB), you need first install the corresponding dependency, then execute the follows scripts for chat:
Please refer to our finetune scripts.
Thanks the community for 3rd-party HuggingFace Demo
We provide code for users to build a web UI demo.
Please run the command below for Composition / Chat:
# For Free-form Text-Image Composition
python examples/gradio_demo_composition.py
# For Multimodal Chat
python examples/gradio_demo_chat.py
The user guidance of UI demo is given in HERE. If you wish to change the default folder of the model, please use the --folder=new_folder
option.
If you find our models / code / papers useful in your research, please consider giving โญ and citations ๐, thx :)
@article{internlmxcomposer2_4khd,
title={InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD},
author={Xiaoyi Dong and Pan Zhang and Yuhang Zang and Yuhang Cao and Bin Wang and Linke Ouyang and Songyang Zhang and Haodong Duan and Wenwei Zhang and Yining Li and Hang Yan and Yang Gao and Zhe Chen and Xinyue Zhang and Wei Li and Jingwen Li and Wenhai Wang and Kai Chen and Conghui He and Xingcheng Zhang and Jifeng Dai and Yu Qiao and Dahua Lin and Jiaqi Wang},
journal={arXiv preprint arXiv:2404.06512},
year={2024}
}
@article{internlmxcomposer2,
title={InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model},
author={Xiaoyi Dong and Pan Zhang and Yuhang Zang and Yuhang Cao and Bin Wang and Linke Ouyang and Xilin Wei and Songyang Zhang and Haodong Duan and Maosong Cao and Wenwei Zhang and Yining Li and Hang Yan and Yang Gao and Xinyue Zhang and Wei Li and Jingwen Li and Kai Chen and Conghui He and Xingcheng Zhang and Yu Qiao and Dahua Lin and Jiaqi Wang},
journal={arXiv preprint arXiv:2401.16420},
year={2024}
}
@article{internlmxcomposer,
title={InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition},
author={Pan Zhang and Xiaoyi Dong and Bin Wang and Yuhang Cao and Chao Xu and Linke Ouyang and Zhiyuan Zhao and Shuangrui Ding and Songyang Zhang and Haodong Duan and Wenwei Zhang and Hang Yan and Xinyue Zhang and Wei Li and Jingwen Li and Kai Chen and Conghui He and Xingcheng Zhang and Yu Qiao and Dahua Lin and Jiaqi Wang},
journal={arXiv preprint arXiv:2309.15112},
year={2023}
}
The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow free commercial usage. To apply for a commercial license, please fill in the application form (English)/็ณ่ฏท่กจ๏ผไธญๆ๏ผ. For other questions or collaborations, please contact internlm@pjlab.org.cn.