InternLM/InternLM-XComposer

InternLM-XComposer-2.5

InternLM-XComposer2.5 🤗

｜ XComposer2.5 Technical Report 📄 [English](./README.md) | [简体中文](./README_CN.md)

Thanks the community for HuggingFace Demo | OpenXLab Demo of InternLM-XComposer-2.5.

👋 join us on Discord and WeChat

Multimodal Projects of Our Team

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

InternLM-XComposer2-: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

ShareGPT4V: Improving Large Multi-modal Models with Better Captions

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models

InternLM-XComposer-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. IXC-2.5 is trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to perform exceptionally well in tasks requiring extensive input and output contexts.

Ultra-High Resolution Understanding: IXC-2.5 enhances the dynamic resolution solution proposed in IXC2-4KHD with a native 560 × 560 ViT vision encoder, supporting high-resolution images with any aspect ratio.
Fine-Grained Video Understanding: IXC-2.5 treats videos as a ultra-high-resolution composite picture consisting of tens to hundreds of frames, allowing it to capture fine details through dense sampling and higher resolution for each frame.
Multi-Turn Multi-Image Dialogue: IXC-2.5 supports free-form multi-turn multi-image dialogue, allowing it to naturally interact with humans in multi-round conversations.
Webpage Crafting: IXC-2.5 can be readily applied to create webpages by composing source code (HTML, CSS, and JavaScript) following text-image instructions.
Composing High-Quality Text-Image Articles: IXC-2.5 leverages specially designed Chain-of-Thought (CoT) and Direct Preference Optimization (DPO) techniques to significantly enhance the quality of its written content.
Awesome performance: IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks.

Please refer to Technical Report for more details.

Demo Video

🔥 For the best experience, please keep the audio on while enjoying the video.

https://github.com/InternLM/InternLM-XComposer/assets/147793160/8206f07f-3166-461e-a631-9cbcdec6ae75

Youtube Video

Please refer to Chinese Demo for the demo of the Chinese version.

News and Updates

2024.07.15 🎉🎉🎉 ModelScope Swift supports InternLM-XComposer2.5-7B for finetuning and inference.
2024.07.15 🎉🎉🎉 LMDepoly supports InternLM-XComposer2.5-7B for 4 bit quantization and inference.
2024.07.15 🎉🎉🎉 InternLM-XComposer2.5-7B-4bit is publicly available.
2024.07.03 🎉🎉🎉 InternLM-XComposer2.5-7B is publicly available.
2024.07.01 🎉🎉🎉 ShareGPT4V is accepted by ECCV2024.
2024.04.22 🎉🎉🎉 The finetune code of InternLM-XComposer2-VL-7B-4KHD-7B are publicly available.
2024.04.09 🎉🎉🎉 InternLM-XComposer2-4KHD-7B and evaluation code are publicly available.
2024.04.09 🎉🎉🎉 InternLM-XComposer2-VL-1.8B is publicly available.
2024.02.22 🎉🎉🎉 We release DualFocus, a framework for integrating macro and micro perspectives within MLLMs to enhance vision-language task performance.
2024.02.06 🎉🎉🎉 InternLM-XComposer2-7B-4bit and InternLM-XComposer-VL2-7B-4bit are publicly available on Hugging Face and ModelScope.
2024.02.02 🎉🎉🎉 The finetune code of InternLM-XComposer2-VL-7B are publicly available.
2024.01.26 🎉🎉🎉 The evaluation code of InternLM-XComposer2-VL-7B are publicly available.
2024.01.26 🎉🎉🎉 InternLM-XComposer2-7B and InternLM-XComposer-VL2-7B are publicly available on Hugging Face and ModelScope.
2024.01.26 🎉🎉🎉 We release a technical report for more details of InternLM-XComposer2 series.
2023.11.22 🎉🎉🎉 We release the ShareGPT4V, a large-scale highly descriptive image-text dataset generated by GPT4-Vision and a superior large multimodal model, ShareGPT4V-7B.
2023.10.30 🎉🎉🎉 InternLM-XComposer-VL achieved the top 1 ranking in both Q-Bench and Tiny LVLM.
2023.10.19 🎉🎉🎉 Support for inference on multiple GPUs. Two 4090 GPUs are sufficient for deploying our demo.
2023.10.12 🎉🎉🎉 4-bit demo is supported, model files are available in Hugging Face and ModelScope.
2023.10.8 🎉🎉🎉 InternLM-XComposer-7B and InternLM-XComposer-VL-7B are publicly available on ModelScope.
2023.9.27 🎉🎉🎉 The evaluation code of InternLM-XComposer-VL-7B are publicly available.
2023.9.27 🎉🎉🎉 InternLM-XComposer-7B and InternLM-XComposer-VL-7B are publicly available on Hugging Face.
2023.9.27 🎉🎉🎉 We release a technical report for more details of our model series.

Model Zoo

Model	Usage	Transformers(HF)	ModelScope(HF)	Release Date
InternLM-XComposer-2.5	Video Understanding, Multi-image Multi-tune Dialog, 4K Resolution Understanding, Web Craft, Article creation, Benchmark	🤗internlm-xcomposer2.5	internlm-xcomposer2.5	2024-07-03
InternLM-XComposer2-4KHD	4K Resolution Understanding, Benchmark, VL-Chat	🤗internlm-xcomposer2-4khd-7b	internlm-xcomposer2-4khd-7b	2024-04-09
InternLM-XComposer2-VL-1.8B	Benchmark, VL-Chat	🤗internlm-xcomposer2-vl-1_8b	internlm-xcomposer2-vl-1_8b	2024-04-09
InternLM-XComposer2	Text-Image Composition	🤗internlm-xcomposer2-7b	internlm-xcomposer2-7b	2024-01-26
InternLM-XComposer2-VL	Benchmark, VL-Chat	🤗internlm-xcomposer2-vl-7b	internlm-xcomposer2-vl-7b	2024-01-26
InternLM-XComposer2-4bit	Text-Image Composition	🤗internlm-xcomposer2-7b-4bit	internlm-xcomposer2-7b-4bit	2024-02-06
InternLM-XComposer2-VL-4bit	Benchmark, VL-Chat	🤗internlm-xcomposer2-vl-7b-4bit	internlm-xcomposer2-vl-7b-4bit	2024-02-06
InternLM-XComposer	Text-Image Composition, VL-Chat	🤗internlm-xcomposer-7b	internlm-xcomposer-7b	2023-09-26
InternLM-XComposer-4bit	Text-Image Composition, VL-Chat	🤗internlm-xcomposer-7b-4bit	internlm-xcomposer-7b-4bit	2023-09-26
InternLM-XComposer-VL	Benchmark	🤗internlm-xcomposer-vl-7b	internlm-xcomposer-vl-7b	2023-09-26

Evaluation

We evaluate InternLM-XComposer-2.5 on 28 multimodal benchmarks, including image benchmarks MMDU, MMStar, RealWorldQA, Design2Code, DocVQA, Infographics VQA, TextVQA, ChartQA, OCRBench, DeepFrom, WTQ, VisualMRC, TabFact, MathVista, MMMU, AI2D, MME, MMBench, MMBench-CN, SEED-Bench, HallusionBench, MM-Vet, and video benchmarks MVBench, MLVU, Video-MME, MMBench-Video, TempCompass

See Evaluation Details here.

Compared with closed-source APIs and previous SOTAs on Video and Structural High-resolution images.

	MVBench	MLVU	MME-Video	MMBench-Video	TempCompass	DocVQA	ChartVQA	InfoVQA	TextVQA	OCRBench	DeepForm	WTQ	VisualMRC	TabFact
	VideoChat2	InternVL1.5	LIVA	InternVL1.5	Qwen-VL	InternVL1.5	InternVL1.5	InternVL1.5	InternVL1.5	GLM-4v	DocOwl 1.5	DocOwl 1.5	DocOwl 1.5	DocOwl 1.5
	7B	26B	34B	26B	7B	26B	26B	26B	26B	9B	8B	8B	8B	8B
	60.4	50.4	59.0	42.0	52.9	90.9	83.8	72.5	80.6	77.6	68.8	40.6	246.4	80.2

GPT-4V	43.5	49.2	59.9	56.0	---	88.4	78.5	75.1	78.0	51.6	---	---	---	---
Gemini-Pro	---	---	75.0	49.3	67.1	88.1	74.1	75.2	74.6	68.0	---	---	---	---
Ours	69.1	58.8	55.8	46.9		90.9	82.2	69.9	78.2	69.0	71.2	53.6	307.5	85.2

Compared with closed-source APIs and previous SOTAs on Multi-Image dialog and General Visual QA Benchmarks.

	MVBench	MLVU	MME-Video	MMBench-Video	TempCompass	DocVQA	ChartVQA	InfoVQA	TextVQA	OCRBench	DeepForm	WTQ	VisualMRC	TabFact
	VideoChat2	InternVL1.5	LIVA	InternVL1.5	Qwen-VL	InternVL1.5	InternVL1.5	InternVL1.5	InternVL1.5	GLM-4v	DocOwl 1.5	DocOwl 1.5	DocOwl 1.5	DocOwl 1.5
	7B	26B	34B	26B	7B	26B	26B	26B	26B	9B	8B	8B	8B	8B
	60.4	50.4	59.0	42.0	58.4	90.9	83.8	72.5	80.6	77.6	68.8	40.6	246.4	80.2

GPT-4V	43.5	49.2	59.9	56.0	---	88.4	78.5	75.1	78.0	51.6	---	---	---	---
Gemini-Pro	---	---	75.0	49.3	70.6	88.1	74.1	75.2	74.6	68.0	---	---	---	---
Ours	69.1	58.8	55.8	46.9	67.1	90.9	82.2	69.9	78.2	69.0	71.2	53.6	307.5	85.2

Requirements

python 3.8 and above
pytorch 1.12 and above, 2.0 and above are recommended
CUDA 11.4 and above are recommended (this is for GPU users)
flash-attention2 is required for high-resolution usage of InternLM-XComposer2.5.

Installation

Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries. Please refer to the installation instructions

Quickstart

We provide a simple example to show how to use InternLM-XComposer-2.5 with 🤗 Transformers.

Video Understanding

Multi-Image Mutli-Tune Dialog

High Resolution Image Understanding

Instruction to Webpage

Resume to Webpage

Screenshot to Webpage

Write Article

```python import torch from transformers import AutoModel, AutoTokenizer torch.set_grad_enabled(False) # init model and tokenizer model = AutoModel.from_pretrained('internlm/internlm-xcomposer2d5-7b', torch_dtype=torch.bfloat16, trust_remote_code=True).cuda().eval().half() tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer2d5-7b', trust_remote_code=True) model.tokenizer = tokenizer query = '阅读下面的材料，根据要求写作。电影《长安三万里》的出现让人感慨，影片并未将重点全落在大唐风华上，也展现了恢弘气象的阴暗面，即旧门阀的资源垄断、朝政的日益衰败与青年才俊的壮志难酬。高适仕进无门，只能回乡>沉潜修行。李白虽得玉真公主举荐，擢入翰林，但他只是成为唐玄宗的御用文人，不能真正实现有益于朝政的志意。然而，片中高潮部分《将进酒》一节，人至中年、挂着肚腩的李白引众人乘仙鹤上天，一路从水面、瀑布飞升至银河进入仙>宫，李白狂奔着与仙人们碰杯，最后大家纵身飞向漩涡般的九重天。肉身的微贱、世路的“天生我材必有用，坎坷，拘不住精神的高蹈。“天生我材必有用，千金散尽还复来。” 古往今来，身处闲顿、遭受挫折、被病痛折磨，很多人都曾经历>了人生的“失意”，却反而成就了他们“诗意”的人生。对正在追求人生价值的当代青年来说，如何对待人生中的缺憾和困顿?诗意人生中又有怎样的自我坚守和自我认同?请结合“失意”与“诗意”这两个关键词写一篇文章。要求:选准角度，确定>立意，明确文体，自拟标题;不要套作，不得抄袭;不得泄露个人信息;不少于 800 字。' with torch.autocast(device_type='cuda', dtype=torch.float16): response = model.write_artical(query, seed=8192) print(response) #诗意人生，贵在坚守 #《菜根谭》有云:“闲时要有吃紧的心思,忙里要留吃闲工夫。”人生在世,总有失意之时,当面对缺憾和困顿,诗意地生活着才能为人生增添一抹亮色。何谓诗意地生活? 所谓诗意地生活，便是在于坚守本心、直面遗憾、超越自我,在失意中寻找人生价值。 #诗意地生活,需坚守本心,淡然处之。 #陶渊明曾执意辞去彭泽县令,归隐田园,“采菊东篱下,悠然见南山”,在山水间寄情自娱；王维面对仕途失意,终日沉醉于诗酒之中,“兴来每独往,胜事空自知”,在诗酒中闲逸自如;李白仕途不顺,被赐金放还,但他依旧豪气干云,“天生我才必有用,千金散尽还复来”,在失意中坦然豁达。坚守本心，便能在遭遇失意之时守住自己的精神家园,让生活充满诗意。反之,若不能坚守本心,而只是一味迎合世俗以求得升迁,那纵使身居高位,亦会丧失生活的乐趣。 #诗意地生活,需直面遗憾,超越自我。 #“西塞山前白鹭飞,桃花流水鳜鱼肥。青箬笠,绿柳枝,半斤酒,一纶丝。五湖四海皆如此,何妨到此处归。”白居易的《渔歌子》写出了多少人的愿望:没有权势纷扰,没有贫困凄凉,只有青山绿水、白鹭鸥鸟作伴,如此自由自在的生活令人神往。然而,白居易却并没有因此真的归隐山林,而是直面人生,超越自我,写下了一首首诗意而富有现实关怀的作品。如果白居易只顾逃避人生,那又怎会拥有“大弦嘈嘈如急雨,小弦切切如私语”的绝美比喻呢?如果白居易只顾归隐山林,那又怎会写出“此曲只应天上有,人间哪得配白居易”这样的诗句呢? #诗意地生活,需直面遗憾,坚守本心。 #李文波患有渐冻症,医生说他活不过五年,但他没有因此放弃对音乐的热爱,而是与病魔作斗争,演奏出美妙的乐曲;孙家林自幼患有脑瘫,但他不甘于命运的捉弄,终成全国最美教师;史铁生饱受疾病折磨,但他仍能发出“我常常在我的心头清点,我有什么?”的叩问,并由此走上文学道路,为后世留下丰厚的文化遗产。这些人没有逃避,而是选择直面人生的缺憾,在坚守本心的同时超越自我,最终实现了自己的价值。 #诗意地生活,是于失意中坚守本心,于缺憾中超越自我。当面对人生的缺憾与挫折,坚守本心、超越自我的同时,也必将书写属于自己的辉煌篇章。 #愿你我都能诗意地生活着! query = 'Please write a blog based on the title: French Pastries: A Sweet Indulgence' with torch.autocast(device_type='cuda', dtype=torch.float16): response = model.write_artical(query, seed=8192) print(response) #French Pastries: A Sweet Indulgence #The French are well known for their love of pastries, and it’s a love that is passed down through generations. When one visits France, they are treated to an assortment of baked goods that can range from the delicate macaron to the rich and decadent chocolate mousse. While there are many delicious types of pastries found in France, five stand out as being the most iconic. Each of these pastries has its own unique qualities that make it special. #1. Croissant #One of the most famous pastries from France is the croissant. It is a buttery, flaky pastry that is best enjoyed fresh from the bakery. The dough is laminated with butter, giving it its signature layers. Croissants are typically eaten for breakfast or brunch, often accompanied by coffee or hot chocolate. #2. Macaron #The macaron is a small, delicate French confection made from almond flour, powdered sugar, and egg whites. The macaron itself is sandwiched with a ganache or jam filling. They come in a variety of colors and flavors, making them a popular choice for both casual snacking and upscale desserts. #3. Madeleine #The madeleine is a small shell-shaped cake that is light and sponge-like. It is often flavored with lemon or orange zest and sometimes dipped in chocolate. Madeleines are perfect for an afternoon snack with tea or coffee. #4. Éclair #The éclair is a long, thin pastry filled with cream and topped with chocolate glaze. It is a classic French treat that is both sweet and satisfying. Éclairs can be found in bakeries all over France and are often enjoyed with a cup of hot chocolate. #5. Tarte Tatin #The tarte Tatin is an apple tart that is known for its caramelized apples and puff pastry crust. It is named after the Tatin sisters who created the recipe in the late 19th century. Tarte Tatin is best served warm with a scoop of vanilla ice cream. #These pastries are just a few of the many delicious treats that France has to offer. Whether you are a seasoned traveler or a first-time visitor, indulging in French pastries is a must-do activity. So go ahead, treat yourself—you deserve it! ```

Inference on Multiple GPUs

If you have multiple GPUs, but the memory size of each GPU is not enough to accommodate the entire model, you can split the model across multiple GPUs. First, install accelerate using the command: pip install accelerate. Then, execute the follows scripts for chat:

# chat with 2 GPUs
python example_code/example_chat.py --num_gpus 2

Inference Acceleration by LMDeploy

If InternLM-XComposer2d5 model inference optimization is required, we recommend using LMDeploy.

In the following subsections, we will introduce the usage of LMDeploy with the internlm-xcomposer2d5-7b model as an example.

First of all, please install the pypi package with pip install lmdeploy. By default, it depends on CUDA 12.x. For a CUDA 11.x environment, please refer to the installation guide.

Offline Inference Pipeline

from lmdeploy import pipeline
from lmdeploy.vl import load_image
pipe = pipeline('internlm/internlm-xcomposer2d5-7b')
image = load_image('examples/dubai.png')
response = pipe(('describe this image', image))
print(response.text)

For more on using the VLM pipeline, including multi-image inference or multi-turn chat, please overview this guide.

4-Bit Model

We offer 4-bit quantized models via LMDeploy to reduce memory requirements. For a memory usage comparison, please refer to here.

from lmdeploy import TurbomindEngineConfig, pipeline
from lmdeploy.vl import load_image
engine_config = TurbomindEngineConfig(model_format='awq')
pipe = pipeline('internlm/internlm-xcomposer2d5-7b-4bit', backend_config=engine_config)
image = load_image('examples/dubai.png')
response = pipe(('describe this image', image))
print(response.text)

Finetune

Please refer to our finetune scripts.
Inference and finetune support from ModelScope Swift

Gradio Deploy

We provide code for users to build a web UI demo. Please use gradio==4.13.0

Please run the command below for Chat / Composition:

# For Multimodal Chat
python gradio_demo/gradio_demo_chat.py

# For Free-form Text-Image Composition
python gradio_demo/gradio_demo_composition.py

The user guidance of UI demo is given in HERE. If you wish to change the default folder of the model, please use the --code_path=new_folder option.

Citation

If you find our models / code / papers useful in your research, please consider giving ⭐ and citations 📝, thx :)

@article{internlmxcomposer2_5,
      title={InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output}, 
      author={Pan Zhang and Xiaoyi Dong and Yuhang Zang and Yuhang Cao and Rui Qian and Lin Chen and Qipeng Guo and Haodong Duan and Bin Wang and Linke Ouyang and Songyang Zhang and Wenwei Zhang and Yining Li and Yang Gao and Peng Sun and Xinyue Zhang and Wei Li and Jingwen Li and Wenhai Wang and Hang Yan and Conghui He and Xingcheng Zhang and Kai Chen and Jifeng Dai and Yu Qiao and Dahua Lin and Jiaqi Wang},
      journal={arXiv preprint arXiv:2407.03320},
      year={2024}
}

@article{internlmxcomposer2_4khd,
      title={InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD},
      author={Xiaoyi Dong and Pan Zhang and Yuhang Zang and Yuhang Cao and Bin Wang and Linke Ouyang and Songyang Zhang and Haodong Duan and Wenwei Zhang and Yining Li and Hang Yan and Yang Gao and Zhe Chen and Xinyue Zhang and Wei Li and Jingwen Li and Wenhai Wang and Kai Chen and Conghui He and Xingcheng Zhang and Jifeng Dai and Yu Qiao and Dahua Lin and Jiaqi Wang},
      journal={arXiv preprint arXiv:2404.06512},
      year={2024}
}

@article{internlmxcomposer2,
      title={InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model},
      author={Xiaoyi Dong and Pan Zhang and Yuhang Zang and Yuhang Cao and Bin Wang and Linke Ouyang and Xilin Wei and Songyang Zhang and Haodong Duan and Maosong Cao and Wenwei Zhang and Yining Li and Hang Yan and Yang Gao and Xinyue Zhang and Wei Li and Jingwen Li and Kai Chen and Conghui He and Xingcheng Zhang and Yu Qiao and Dahua Lin and Jiaqi Wang},
      journal={arXiv preprint arXiv:2401.16420},
      year={2024}
}

@article{internlmxcomposer,
      title={InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition},
      author={Pan Zhang and Xiaoyi Dong and Bin Wang and Yuhang Cao and Chao Xu and Linke Ouyang and Zhiyuan Zhao and Shuangrui Ding and Songyang Zhang and Haodong Duan and Wenwei Zhang and Hang Yan and Xinyue Zhang and Wei Li and Jingwen Li and Kai Chen and Conghui He and Xingcheng Zhang and Yu Qiao and Dahua Lin and Jiaqi Wang},
      journal={arXiv preprint arXiv:2309.15112},
      year={2023}
}

License & Contact Us

The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow free commercial usage. To apply for a commercial license, please fill in the application form (English)/申请表（中文）. For other questions or collaborations, please contact internlm@pjlab.org.cn.