MiniCPM-V is a series of end-side multimodal LLMs (MLLMs) designed for vision-language understanding. The models take image, video and text as inputs and provide high-quality text outputs. Since February 2024, we have released 5 versions of the model, aiming to achieve strong performance and efficient deployment. The most notable models in this series currently include:
-
MiniCPM-V 2.6: ๐ฅ๐ฅ๐ฅ The latest and most capable model in the MiniCPM-V series. With a total of 8B parameters, the model surpasses GPT-4V in single image, multi-image and video understanding. It outperforms GPT-4o mini, Gemini 1.5 Pro and Claude 3.5 Sonnet in single image understanding, and advances MiniCPM-Llama3-V 2.5's features such as strong OCR capability, trustworthy behavior, multilingual support, and end-side deployment. Due to its superior token density, MiniCPM-V 2.6 can for the first time support real-time video understanding on end-side devices such as iPad.
-
MiniCPM-V 2.0: The lightest model in the MiniCPM-V series. With 2B parameters, it surpasses larger models such as Yi-VL 34B, CogVLM-Chat 17B, and Qwen-VL-Chat 10B in overall performance. It can accept image inputs of any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), achieving comparable performance with Gemini Pro in understanding scene-text and matches GPT-4V in low hallucination rates.
News
๐ Pinned
- [2024.08.17] ๐๐๐ MiniCPM-V 2.6 is now fully supported by official llama.cpp! GGUF models of various sizes are available here.
- [2024.08.15] We now also support multi-image SFT. For more details, please refer to the document.
- [2024.08.14] MiniCPM-V 2.6 now also supports fine-tuning with the SWIFT framework!
- [2024.08.10] ๐๐๐ MiniCPM-Llama3-V 2.5 is now fully supported by official llama.cpp! GGUF models of various sizes are available here.
- [2024.08.06] ๐ฅ๐ฅ๐ฅ We open-source MiniCPM-V 2.6, which outperforms GPT-4V on single image, multi-image and video understanding. It advances popular features of MiniCPM-Llama3-V 2.5, and can support real-time video understanding on iPad. Try it now!
- [2024.08.03] MiniCPM-Llama3-V 2.5 technical report is released! See here.
- [2024.07.19] MiniCPM-Llama3-V 2.5 supports vLLM now! See here.
- [2024.05.28] ๐ซ We now support LoRA fine-tuning for MiniCPM-Llama3-V 2.5, using only 2 V100 GPUs! See more statistics here.
- [2024.05.23] ๐ We've released a comprehensive comparison between Phi-3-vision-128k-instruct and MiniCPM-Llama3-V 2.5, including benchmarks evaluations, multilingual capabilities, and inference efficiency ๐๐๐๐. Click here to view more details.
- [2024.05.23] ๐ฅ๐ฅ๐ฅ MiniCPM-V tops GitHub Trending and Hugging Face Trending! Our demo, recommended by Hugging Face Gradioโs official account, is available here. Come and try it out!
Click to view more news.
* [2024.06.03] Now, you can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs(12 GB or 16 GB) by distributing the model's layers across multiple GPUs. For more details, Check this [link](https://github.com/OpenBMB/MiniCPM-V/blob/main/docs/inference_on_multiple_gpus.md).
* [2024.05.28] ๐๐๐ MiniCPM-Llama3-V 2.5 now fully supports its feature in llama.cpp and ollama! Please pull the latest code **of our provided forks** ([llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md), [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5)). GGUF models in various sizes are available [here](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf/tree/main). MiniCPM-Llama3-V 2.5 series is **not supported by the official repositories yet**, and we are working hard to merge PRs. Please stay tuned!
* [2024.05.25] MiniCPM-Llama3-V 2.5 now supports streaming outputs and customized system prompts. Try it [here](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5#usage)!
* [2024.05.24] We release the MiniCPM-Llama3-V 2.5 [gguf](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf), which supports [llama.cpp](#inference-with-llamacpp) inference and provides a 6~8 token/s smooth decoding on mobile phones. Try it now!
* [2024.05.20] We open-soure MiniCPM-Llama3-V 2.5, it has improved OCR capability and supports 30+ languages, representing the first end-side MLLM achieving GPT-4V level performance! We provide [efficient inference](#deployment-on-mobile-phone) and [simple fine-tuning](./finetune/readme.md). Try it now!
* [2024.04.23] MiniCPM-V-2.0 supports vLLM now! Click [here](#inference-with-vllm) to view more details.
* [2024.04.18] We create a HuggingFace Space to host the demo of MiniCPM-V 2.0 at [here](https://huggingface.co/spaces/openbmb/MiniCPM-V-2)!
* [2024.04.17] MiniCPM-V-2.0 supports deploying [WebUI Demo](#webui-demo) now!
* [2024.04.15] MiniCPM-V-2.0 now also supports [fine-tuning](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-v-2ๆไฝณๅฎ่ทต.md) with the SWIFT framework!
* [2024.04.12] We open-source MiniCPM-V 2.0, which achieves comparable performance with Gemini Pro in understanding scene text and outperforms strong Qwen-VL-Chat 9.6B and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. Click here to view the MiniCPM-V 2.0 technical blog.
* [2024.03.14] MiniCPM-V now supports [fine-tuning](https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/minicpm-vๆไฝณๅฎ่ทต.md) with the SWIFT framework. Thanks to [Jintao](https://github.com/Jintao-Huang) for the contribution๏ผ
* [2024.03.01] MiniCPM-V now can be deployed on Mac!
* [2024.02.01] We open-source MiniCPM-V and OmniLMM-12B, which support efficient end-side deployment and powerful multimodal capabilities correspondingly.
Contents
MiniCPM-V 2.6
MiniCPM-V 2.6 is the latest and most capable model in the MiniCPM-V series. The model is built on SigLip-400M and Qwen2-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-Llama3-V 2.5, and introduces new features for multi-image and video understanding. Notable features of MiniCPM-V 2.6 include:
-
๐ฅ Leading Performance.
MiniCPM-V 2.6 achieves an average score of 65.2 on the latest version of OpenCompass, a comprehensive evaluation over 8 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet for single image understanding.
-
๐ผ๏ธ Multi Image Understanding and In-context Learning. MiniCPM-V 2.6 can also perform conversation and reasoning over multiple images. It achieves state-of-the-art performance on popular multi-image benchmarks such as Mantis-Eval, BLINK, Mathverse mv and Sciverse mv, and also shows promising in-context learning capability.
-
๐ฌ Video Understanding. MiniCPM-V 2.6 can also accept video inputs, performing conversation and providing dense captions for spatial-temporal information. It outperforms GPT-4V, Claude 3.5 Sonnet and LLaVA-NeXT-Video-34B on Video-MME with/without subtitles.
-
๐ช Strong OCR Capability and Others.
MiniCPM-V 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves state-of-the-art performance on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V, and Gemini 1.5 Pro.
Based on the the latest RLAIF-V and VisCPM techniques, it features trustworthy behaviors, with significantly lower hallucination rates than GPT-4o and GPT-4V on Object HalBench, and supports multilingual capabilities on English, Chinese, German, French, Italian, Korean, etc.
-
๐ Superior Efficiency.
In addition to its friendly size, MiniCPM-V 2.6 also shows state-of-the-art token density (i.e., number of pixels encoded into each visual token). It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-V 2.6 can efficiently support real-time video understanding on end-side devices such as iPad.
-
๐ซ Easy Usage.
MiniCPM-V 2.6 can be easily used in various ways: (1) llama.cpp and ollama support for efficient CPU inference on local devices, (2) int4 and GGUF format quantized models in 16 sizes, (3) vLLM support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks, (5) quick local WebUI demo setup with Gradio, and (6) online web demo.
Evaluation
Click to view single image results on OpenCompass, MME, MMVet, OCRBench, MMMU, MathVista, MMB, AI2D, TextVQA, DocVQA, HallusionBench, Object HalBench.
Model |
Size |
Token Density+ |
OpenCompass |
MME |
MMVet |
OCRBench |
MMMU val |
MathVista mini |
MMB1.1 test |
AI2D |
TextVQA val |
DocVQA test |
HallusionBench |
Object HalBench |
Proprietary |
GPT-4o |
- |
1088 |
69.9 |
2328.7 |
69.1 |
736 |
69.2 |
61.3 |
82.2 |
84.6 |
- |
92.8 |
55.0 |
17.6 |
Claude 3.5 Sonnet |
- |
750 |
67.9 |
1920.0 |
66.0 |
788 |
65.9 |
61.6 |
78.5 |
80.2 |
- |
95.2 |
49.9 |
13.8 |
Gemini 1.5 Pro |
- |
- |
64.4 |
2110.6 |
64.0 |
754 |
60.6 |
57.7 |
73.9 |
79.1 |
73.5 |
86.5 |
45.6 |
- |
GPT-4o mini |
- |
1088 |
64.1 |
2003.4 |
66.9 |
785 |
60.0 |
52.4 |
76.0 |
77.8 |
- |
- |
46.1 |
12.4 |
GPT-4V |
- |
1088 |
63.5 |
2070.2 |
67.5 |
656 |
61.7 |
54.7 |
79.8 |
78.6 |
78.0 |
87.2 |
43.9 |
14.2 |
Step-1V |
- |
- |
59.5 |
2206.4 |
63.3 |
625 |
49.9 |
44.8 |
78.0 |
79.2 |
71.6 |
- |
48.4 |
- |
Qwen-VL-Max |
- |
784 |
58.3 |
2281.7 |
61.8 |
684 |
52.0 |
43.4 |
74.6 |
75.7 |
79.5 |
93.1 |
41.2 |
13.4 |
Open-source |
LLaVA-NeXT-Yi-34B |
34B |
157 |
55.0 |
2006.5 |
50.7 |
574 |
48.8 |
40.4 |
77.8 |
78.9 |
69.3 |
- |
34.8 |
12.6 |
Mini-Gemini-HD-34B |
34B |
157 |
- |
2141.0 |
59.3 |
518 |
48.0 |
43.3 |
- |
80.5 |
74.1 |
78.9 |
- |
- |
Cambrian-34B |
34B |
1820 |
58.3 |
2049.9 |
53.2 |
591 |
50.4 |
50.3 |
77.8 |
79.5 |
76.7 |
75.5 |
41.6 |
14.7 |
GLM-4V-9B |
13B |
784 |
59.1 |
2018.8 |
58.0 |
776 |
46.9 |
51.1 |
67.9 |
71.2 |
- |
- |
45.0 |
- |
InternVL2-8B |
8B |
706 |
64.1 |
2215.1 |
54.3 |
794 |
51.2 |
58.3 |
79.4 |
83.6 |
77.4 |
91.6 |
45.0 |
21.3 |
MiniCPM-Llama-V 2.5 |
8B |
1882 |
58.8 |
2024.6 |
52.8 |
725 |
45.8 |
54.3 |
72.0 |
78.4 |
76.6 |
84.8 |
42.4 |
10.3 |
MiniCPM-V 2.6 |
8B |
2822 |
65.2 |
2348.4* |
60.0 |
852* |
49.8* |
60.6 |
78.0 |
82.1 |
80.1 |
90.8 |
48.1* |
8.2 |
* We evaluate this benchmark using chain-of-thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
+ Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper-bound estimation.
Click to view multi-image results on Mantis Eval, BLINK, Mathverse mv, Sciverse mv, MIRB.
Model |
Size |
Mantis Eval |
BLINK val |
Mathverse mv |
Sciverse mv |
MIRB |
Proprietary |
GPT-4V |
- |
62.7 |
54.6 |
60.3 |
66.9 |
53.1 |
LLaVA-NeXT-Interleave-14B |
14B |
66.4 |
52.6 |
32.7 |
30.2 |
- |
Open-source |
Emu2-Chat |
37B |
37.8 |
36.2 |
- |
27.2 |
- |
CogVLM |
17B |
45.2 |
41.1 |
- |
- |
- |
VPG-C |
7B |
52.4 |
43.1 |
24.3 |
23.1 |
- |
VILA 8B |
8B |
51.2 |
39.3 |
- |
36.5 |
- |
InternLM-XComposer-2.5 |
8B |
53.1* |
48.9 |
32.1* |
- |
42.5 |
InternVL2-8B |
8B |
59.0* |
50.9 |
30.5* |
34.4* |
56.9* |
MiniCPM-V 2.6 |
8B |
69.1 |
53.0 |
84.9 |
74.9 |
53.8 |
* We evaluate the officially released checkpoint by ourselves.
Click to view video results on Video-MME and Video-ChatGPT.
Model |
Size |
Video-MME |
Video-ChatGPT |
|
|
w/o subs |
w subs |
Correctness |
Detail |
Context |
Temporal |
Consistency |
Proprietary |
Claude 3.5 Sonnet |
- |
60.0 |
62.9 |
- |
- |
- |
- |
- |
GPT-4V |
- |
59.9 |
63.3 |
- |
- |
- |
- |
- |
Open-source |
LLaVA-NeXT-7B |
7B |
- |
- |
3.39 |
3.29 |
3.92 |
2.60 |
3.12 |
LLaVA-NeXT-34B |
34B |
- |
- |
3.29 |
3.23 |
3.83 |
2.51 |
3.47 |
CogVLM2-Video |
12B |
- |
- |
3.49 |
3.46 |
3.23 |
2.98 |
3.64 |
LongVA |
7B |
52.4 |
54.3 |
3.05 |
3.09 |
3.77 |
2.44 |
3.64 |
InternVL2-8B |
8B |
54.0 |
56.9 |
- |
- |
- |
- |
- |
InternLM-XComposer-2.5 |
8B |
55.8 |
- |
- |
- |
- |
- |
- |
LLaVA-NeXT-Video |
32B |
60.2 |
63.0 |
3.48 |
3.37 |
3.95 |
2.64 |
3.28 |
MiniCPM-V 2.6 |
8B |
60.9 |
63.6 |
3.59 |
3.28 |
3.93 |
2.73 |
3.62 |
Click to view few-shot results on TextVQA, VizWiz, VQAv2, OK-VQA.
Model |
Size |
Shot |
TextVQA val |
VizWiz test-dev |
VQAv2 test-dev |
OK-VQA val |
Flamingo |
80B |
0* |
35.0 |
31.6 |
56.3 |
40.6 |
4 |
36.5 |
39.6 |
63.1 |
57.4 |
8 |
37.3 |
44.8 |
65.6 |
57.5 |
IDEFICS |
80B |
0* |
30.9 |
36.0 |
60.0 |
45.2 |
4 |
34.3 |
40.4 |
63.6 |
52.4 |
8 |
35.7 |
46.1 |
64.8 |
55.1 |
OmniCorpus |
7B |
0* |
43.0 |
49.8 |
63.2 |
45.5 |
4 |
45.4 |
51.3 |
64.5 |
46.5 |
8 |
45.6 |
52.2 |
64.7 |
46.6 |
Emu2 |
37B |
0 |
26.4 |
40.4 |
33.5 |
26.7 |
4 |
48.2 |
54.6 |
67.0 |
53.2 |
8 |
49.3 |
54.7 |
67.8 |
54.1 |
MM1 |
30B |
0 |
26.2 |
40.4 |
48.9 |
26.7 |
8 |
49.3 |
54.7 |
70.9 |
54.1 |
MiniCPM-V 2.6+ |
8B |
0 |
43.9 |
33.8 |
45.4 |
23.9 |
4 |
63.6 |
60.5 |
65.5 |
50.1 |
8 |
64.6 |
63.4 |
68.2 |
51.4 |
* denotes zero image shot and two additional text shots following Flamingo.
+ We evaluate the pretraining ckpt without SFT.
Examples
Click to view more cases.
We deploy MiniCPM-V 2.6 on end devices. The demo video is the raw screen recording on a iPad Pro without edition.
MiniCPM-Llama3-V 2.5
Click to view more details of MiniCPM-Llama3-V 2.5
**MiniCPM-Llama3-V 2.5** is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.0. Notable features of MiniCPM-Llama3-V 2.5 include:
- ๐ฅ **Leading Performance.**
MiniCPM-Llama3-V 2.5 has achieved an average score of 65.1 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks. **With only 8B parameters, it surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Claude 3 and Qwen-VL-Max** and greatly outperforms other Llama 3-based MLLMs.
- ๐ช **Strong OCR Capabilities.**
MiniCPM-Llama3-V 2.5 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), achieving a **700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro**. Based on recent user feedback, MiniCPM-Llama3-V 2.5 has now enhanced full-text OCR extraction, table-to-markdown conversion, and other high-utility capabilities, and has further strengthened its instruction-following and complex reasoning abilities, enhancing multimodal interaction experiences.
- ๐ **Trustworthy Behavior.**
Leveraging the latest [RLAIF-V](https://github.com/RLHF-V/RLAIF-V/) method (the newest technique in the [RLHF-V](https://github.com/RLHF-V) [CVPR'24] series), MiniCPM-Llama3-V 2.5 exhibits more trustworthy behavior. It achieves a **10.3%** hallucination rate on Object HalBench, lower than GPT-4V-1106 (13.6%), achieving the best-level performance within the open-source community. [Data released](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset).
- ๐ **Multilingual Support.**
Thanks to the strong multilingual capabilities of Llama 3 and the cross-lingual generalization technique from [VisCPM](https://github.com/OpenBMB/VisCPM), MiniCPM-Llama3-V 2.5 extends its bilingual (Chinese-English) multimodal capabilities to **over 30 languages including German, French, Spanish, Italian, Korean etc.** [All Supported Languages](./assets/minicpm-llama-v-2-5_languages.md).
- ๐ **Efficient Deployment.**
MiniCPM-Llama3-V 2.5 systematically employs **model quantization, CPU optimizations, NPU optimizations and compilation optimizations**, achieving high-efficiency deployment on end-side devices. For mobile phones with Qualcomm chips, we have integrated the NPU acceleration framework QNN into llama.cpp for the first time. After systematic optimization, MiniCPM-Llama3-V 2.5 has realized a **150x acceleration in end-side MLLM image encoding** and a **3x speedup in language decoding**.
- ๐ซ **Easy Usage.**
MiniCPM-Llama3-V 2.5 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-v2.5/examples/minicpmv/README.md) and [ollama](https://github.com/OpenBMB/ollama/tree/minicpm-v2.5/examples/minicpm-v2.5) support for efficient CPU inference on local devices, (2) [GGUF](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5-gguf) format quantized models in 16 sizes, (3) efficient [LoRA](https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune#lora-finetuning) fine-tuning with only 2 V100 GPUs, (4) [streaming output](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5#usage), (5) quick local WebUI demo setup with [Gradio](https://github.com/OpenBMB/MiniCPM-V/blob/main/web_demo_2.5.py) and [Streamlit](https://github.com/OpenBMB/MiniCPM-V/blob/main/web_demo_streamlit-2_5.py), and (6) interactive demos on [HuggingFace Spaces](https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5).
### Evaluation
Click to view results on TextVQA, DocVQA, OCRBench, OpenCompass, MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench.
Model |
Size |
OCRBench |
TextVQA val |
DocVQA test |
Open-Compass |
MME |
MMB test (en) |
MMB test (cn) |
MMMU val |
Math-Vista |
LLaVA Bench |
RealWorld QA |
Object HalBench |
Proprietary |
Gemini Pro |
- |
680 |
74.6 |
88.1 |
62.9 |
2148.9 |
73.6 |
74.3 |
48.9 |
45.8 |
79.9 |
60.4 |
- |
GPT-4V (2023.11.06) |
- |
645 |
78.0 |
88.4 |
63.5 |
1771.5 |
77.0 |
74.4 |
53.8 |
47.8 |
93.1 |
63.0 |
86.4 |
Open-source |
Mini-Gemini |
2.2B |
- |
56.2 |
34.2* |
- |
1653.0 |
- |
- |
31.7 |
- |
- |
- |
- |
Qwen-VL-Chat |
9.6B |
488 |
61.5 |
62.6 |
51.6 |
1860.0 |
61.8 |
56.3 |
37.0 |
33.8 |
67.7 |
49.3 |
56.2 |
DeepSeek-VL-7B |
7.3B |
435 |
64.7* |
47.0* |
54.6 |
1765.4 |
73.8 |
71.4 |
38.3 |
36.8 |
77.8 |
54.2 |
- |
Yi-VL-34B |
34B |
290 |
43.4* |
16.9* |
52.2 |
2050.2 |
72.4 |
70.7 |
45.1 |
30.7 |
62.3 |
54.8 |
79.3 |
CogVLM-Chat |
17.4B |
590 |
70.4 |
33.3* |
54.2 |
1736.6 |
65.8 |
55.9 |
37.3 |
34.7 |
73.9 |
60.3 |
73.6 |
TextMonkey |
9.7B |
558 |
64.3 |
66.7 |
- |
- |
- |
- |
- |
- |
- |
- |
- |
Idefics2 |
8.0B |
- |
73.0 |
74.0 |
57.2 |
1847.6 |
75.7 |
68.6 |
45.2 |
52.2 |
49.1 |
60.7 |
- |
Bunny-LLama-3-8B |
8.4B |
- |
- |
- |
54.3 |
1920.3 |
77.0 |
73.9 |
41.3 |
31.5 |
61.2 |
58.8 |
- |
LLaVA-NeXT Llama-3-8B |
8.4B |
- |
- |
78.2 |
- |
1971.5 |
- |
- |
41.7 |
37.5 |
80.1 |
60.0 |
- |
Phi-3-vision-128k-instruct |
4.2B |
639* |
70.9 |
- |
- |
1537.5* |
- |
- |
40.4 |
44.5 |
64.2* |
58.8* |
- |
MiniCPM-V 1.0 |
2.8B |
366 |
60.6 |
38.2 |
47.5 |
1650.2 |
64.1 |
62.6 |
38.3 |
28.9 |
51.3 |
51.2 |
78.4 |
MiniCPM-V 2.0 |
2.8B |
605 |
74.1 |
71.9 |
54.5 |
1808.6 |
69.1 |
66.5 |
38.2 |
38.7 |
69.2 |
55.8 |
85.5 |
MiniCPM-Llama3-V 2.5 |
8.5B |
725 |
76.6 |
84.8 |
65.1 |
2024.6 |
77.2 |
74.2 |
45.8 |
54.3 |
86.7 |
63.5 |
89.7 |
* We evaluate the officially released checkpoint by ourselves.
Evaluation results of multilingual LLaVA Bench
### Examples
MiniCPM-V 2.0
Click to view more details of MiniCPM-V 2.0
**MiniCPM-V 2.0** is an efficient version with promising performance for deployment. The model is built based on SigLip-400M and [MiniCPM-2.4B](https://github.com/OpenBMB/MiniCPM/), connected by a perceiver resampler. Our latest version, MiniCPM-V 2.0 has several notable features.
- ๐ฅ **State-of-the-art Performance.**
MiniCPM-V 2.0 achieves **state-of-the-art performance** on multiple benchmarks (including OCRBench, TextVQA, MME, MMB, MathVista, etc) among models under 7B parameters. It even **outperforms strong Qwen-VL-Chat 9.6B, CogVLM-Chat 17.4B, and Yi-VL 34B on OpenCompass, a comprehensive evaluation over 11 popular benchmarks**. Notably, MiniCPM-V 2.0 shows **strong OCR capability**, achieving **comparable performance to Gemini Pro in scene-text understanding**, and **state-of-the-art performance on OCRBench** among open-source models.
- ๐ **Trustworthy Behavior.**
LMMs are known for suffering from hallucination, often generating text not factually grounded in images. MiniCPM-V 2.0 is **the first end-side LMM aligned via multimodal RLHF for trustworthy behavior** (using the recent [RLHF-V](https://rlhf-v.github.io/) [CVPR'24] series technique). This allows the model to **match GPT-4V in preventing hallucinations** on Object HalBench.
- ๐ **High-Resolution Images at Any Aspect Raito.**
MiniCPM-V 2.0 can accept **1.8 million pixels (e.g., 1344x1344) images at any aspect ratio**. This enables better perception of fine-grained visual information such as small objects and optical characters, which is achieved via a recent technique from [LLaVA-UHD](https://arxiv.org/pdf/2403.11703.pdf).
- โก๏ธ **High Efficiency.**
MiniCPM-V 2.0 can be **efficiently deployed on most GPU cards and personal computers**, and **even on end devices such as mobile phones**. For visual encoding, we compress the image representations into much fewer tokens via a perceiver resampler. This allows MiniCPM-V 2.0 to operate with **favorable memory cost and speed during inference even when dealing with high-resolution images**.
- ๐ **Bilingual Support.**
MiniCPM-V 2.0 **supports strong bilingual multimodal capabilities in both English and Chinese**. This is enabled by generalizing multimodal capabilities across languages, a technique from [VisCPM](https://arxiv.org/abs/2308.12038) [ICLR'24].
### Examples
We deploy MiniCPM-V 2.0 on end devices. The demo video is the raw screen recording on a Xiaomi 14 Pro without edition.
Legacy Models
Chat with Our Demo on Gradio ๐ค
We provide online and local demos powered by Hugging Face Gradio , the most popular model deployment framework nowadays. It supports streaming outputs, progress bars, queuing, alerts, and other useful features.
Online Demo
Click here to try out the online demo of MiniCPM-V 2.6 | MiniCPM-Llama3-V 2.5 | MiniCPM-V 2.0.
Local WebUI Demo
You can easily build your own local WebUI demo with Gradio using the following commands.
pip install -r requirements.txt
# For NVIDIA GPUs, run:
python web_demo_2.6.py --device cuda
Install
- Clone this repository and navigate to the source folder
git clone https://github.com/OpenBMB/MiniCPM-V.git
cd MiniCPM-V
- Create conda environment
conda create -n MiniCPM-V python=3.10 -y
conda activate MiniCPM-V
- Install dependencies
pip install -r requirements.txt
Inference
Model Zoo
Model |
Device |
Memory |
Description |
Download |
MiniCPM-V 2.6 |
GPU |
17 GB |
The latest version, achieving state-of-the-art end-side performance for single image, multi-image and video understanding. |
๐ค |
MiniCPM-V 2.6 gguf |
CPU |
6 GB |
The gguf version, lower memory usage and faster inference. |
๐ค |
MiniCPM-V 2.6 int4 |
GPU |
7 GB |
The int4 quantized version, lower GPU memory usage. |
๐ค |
MiniCPM-Llama3-V 2.5 |
GPU |
19 GB |
Strong end-side multimodal performance. |
๐ค |
MiniCPM-Llama3-V 2.5 gguf |
CPU |
6 GB |
The gguf version, lower memory usage and faster inference. |
๐ค |
MiniCPM-Llama3-V 2.5 int4 |
GPU |
8 GB |
The int4 quantized version, lower GPU memory usage. |
๐ค |
MiniCPM-V 2.0 |
GPU |
8 GB |
Light version, balance the performance the computation cost. |
๐ค |
MiniCPM-V 1.0 |
GPU |
7 GB |
Lightest version, achieving the fastest inference. |
๐ค |
Multi-turn Conversation
Please refer to the following codes to run.
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
torch.manual_seed(0)
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
image = Image.open('./assets/airplane.jpeg').convert('RGB')
# First round chat
question = "Tell me the model of this aircraft."
msgs = [{'role': 'user', 'content': [image, question]}]
answer = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
# Second round chat
# pass history context of multi-turn conversation
msgs.append({"role": "assistant", "content": [answer]})
msgs.append({"role": "user", "content": ["Introduce something about Airbus A380."]})
answer = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
You will get the following output:
"The aircraft in the image is an Airbus A380, which can be identified by its large size, double-deck structure, and the distinctive shape of its wings and engines. The A380 is a wide-body aircraft known for being the world's largest passenger airliner, designed for long-haul flights. It has four engines, which are characteristic of large commercial aircraft. The registration number on the aircraft can also provide specific information about the model if looked up in an aviation database."
"The Airbus A380 is a double-deck, wide-body, four-engine jet airliner made by Airbus. It is the world's largest passenger airliner and is known for its long-haul capabilities. The aircraft was developed to improve efficiency and comfort for passengers traveling over long distances. It has two full-length passenger decks, which can accommodate more passengers than a typical single-aisle airplane. The A380 has been operated by airlines such as Lufthansa, Singapore Airlines, and Emirates, among others. It is widely recognized for its unique design and significant impact on the aviation industry."
Chat with multiple images
Click to view Python code running MiniCPM-V 2.6 with multiple images input.
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
msgs = [{'role': 'user', 'content': [image1, image2, question]}]
answer = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
```
In-context few-shot learning
Click to view Python code running MiniCPM-V 2.6 with few-shot input.
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
question = "production date"
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')
msgs = [
{'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
{'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
{'role': 'user', 'content': [image_test, question]}
]
answer = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer
)
print(answer)
```
Chat with video
Click to view Python code running MiniCPM-V 2.6 with video input.
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu # pip install decord
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
def encode_video(video_path):
def uniform_sample(l, n):
gap = len(l) / n
idxs = [int(i * gap + gap / 2) for i in range(n)]
return [l[i] for i in idxs]
vr = VideoReader(video_path, ctx=cpu(0))
sample_fps = round(vr.get_avg_fps() / 1) # FPS
frame_idx = [i for i in range(0, len(vr), sample_fps)]
if len(frame_idx) > MAX_NUM_FRAMES:
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
frames = vr.get_batch(frame_idx).asnumpy()
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
print('num frames:', len(frames))
return frames
video_path="video_test.mp4"
frames = encode_video(video_path)
question = "Describe the video"
msgs = [
{'role': 'user', 'content': frames + [question]},
]
# Set decode params for video
params = {}
params["use_image_id"] = False
params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448
answer = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer,
**params
)
print(answer)
```
Inference on Multiple GPUs
You can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs (12 GB or 16 GB) by distributing the model's layers across multiple GPUs. Please refer to this tutorial for detailed instructions on how to load the model and inference using multiple low VRAM GPUs.
Inference on Mac
Click to view an example, to run MiniCPM-Llama3-V 2.5 on ๐ป Mac with MPS (Apple silicon or AMD GPUs).
```python
# test.py Need more than 16GB memory.
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True, low_cpu_mem_usage=True)
model = model.to(device='mps')
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True)
model.eval()
image = Image.open('./assets/hk_OCR.jpg').convert('RGB')
question = 'Where is this photo taken?'
msgs = [{'role': 'user', 'content': question}]
answer, context, _ = model.chat(
image=image,
msgs=msgs,
context=None,
tokenizer=tokenizer,
sampling=True
)
print(answer)
```
Run with command:
```shell
PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py
```
Deployment on Mobile Phone
MiniCPM-V 2.0 can be deployed on mobile phones with Android operating systems. ๐ Click MiniCPM-V 2.0 to install apk.
Inference with llama.cpp
MiniCPM-V 2.6 can run with llama.cpp now! See our fork of llama.cpp for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environment๏ผiPad Pro + M4).
Inference with ollama
MiniCPM-V 2.6 can run with ollama now! See our fork of ollama for more detail. This implementation supports smooth inference of 16~18 token/s on iPad (test environment๏ผiPad Pro + M4).
Inference with vLLM
vLLM now officially supports MiniCPM-V 2.6, MiniCPM-Llama3-V 2.5 and MiniCPM-V 2.0, Click to see.
1. Install vLLM(>=0.5.4):
```shell
pip install vllm
```
2. Install timm: (optional, MiniCPM-V 2.0 need timm)
```shell
pip install timm==0.9.10
```
3. Run the example(for image):
```python
from transformers import AutoTokenizer
from PIL import Image
from vllm import LLM, SamplingParams
MODEL_NAME = "openbmb/MiniCPM-V-2_6"
# Also available for previous models
# MODEL_NAME = "openbmb/MiniCPM-Llama3-V-2_5"
# MODEL_NAME = "HwwwH/MiniCPM-V-2"
image = Image.open("xxx.png").convert("RGB")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
llm = LLM(
model=MODEL_NAME,
trust_remote_code=True,
gpu_memory_utilization=1,
max_model_len=2048
)
messages = [{
"role":
"user",
"content":
# Number of images
"(./)" + \
"\nWhat is the content of this image?"
}]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Single Inference
inputs = {
"prompt": prompt,
"multi_modal_data": {
"image": image
# Multi images, the number of images should be equal to that of `(./)`
# "image": [image, image]
},
}
# Batch Inference
# inputs = [{
# "prompt": prompt,
# "multi_modal_data": {
# "image": image
# },
# } for _ in 2]
# 2.6
stop_tokens = ['<|im_end|>', '<|endoftext|>']
stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
# 2.0
# stop_token_ids = [tokenizer.eos_id]
# 2.5
# stop_token_ids = [tokenizer.eos_id, tokenizer.eot_id]
sampling_params = SamplingParams(
stop_token_ids=stop_token_ids,
use_beam_search=True,
temperature=0,
best_of=3,
max_tokens=1024
)
outputs = llm.generate(inputs, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
```
4. click [here](https://modelbest.feishu.cn/wiki/C2BWw4ZP0iCDy7kkCPCcX2BHnOf?from=from_copylink) if you want to use it with *video*, or get more details about `vLLM`.
Fine-tuning
Simple Fine-tuning
We support simple fine-tuning with Hugging Face for MiniCPM-V 2.0 and MiniCPM-Llama3-V 2.5.
Reference Document
With the SWIFT Framework
We now support MiniCPM-V series fine-tuning with the SWIFT framework. SWIFT supports training, inference, evaluation and deployment of nearly 200 LLMs and MLLMs . It supports the lightweight training solutions provided by PEFT and a complete Adapters Library including techniques such as NEFTune, LoRA+ and LLaMA-PRO.
Best Practices๏ผMiniCPM-V 1.0, MiniCPM-V 2.0, MiniCPM-V 2.6.
FAQs
Click here to view the FAQs
Model License
-
This repository is released under the Apache-2.0 License.
-
The usage of MiniCPM-V model weights must strictly follow MiniCPM Model License.md.
-
The models and weights of MiniCPM are completely free for academic research. after filling out a "questionnaire" for registration, are also available for free commercial use.
Statement
As LMMs, MiniCPM-V models (including OmniLMM) generate contents by learning a large amount of multimodal corpora, but they cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-V models does not represent the views and positions of the model developers
We will not be liable for any problems arising from the use of MiniCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
Institutions
This project is developed by the following institutions:
๐ Star History
Key Techniques and Other Multimodal Projects
๐ Welcome to explore key techniques of MiniCPM-V and other multimodal projects of our team:
VisCPM | RLHF-V | LLaVA-UHD | RLAIF-V
Citation
If you find our model/code/paper helpful, please consider cite our papers ๐ and star us โญ๏ธ๏ผ
@article{yao2024minicpm,
title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
journal={arXiv preprint arXiv:2408.01800},
year={2024}
}