Meituan-AutoML / MobileVLM

Strong and Open Vision Language Assistant for Mobile Devices
Apache License 2.0
890 stars 64 forks source link

MobileVLM: Vision Language Model for Mobile Devices

[![hf_space](https://img.shields.io/badge/🤗-MTGV%20HuggingFace-blue.svg)](https://huggingface.co/mtgv) [![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE) [![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/Meituan-AutoML/MobileVLM.git)[![github](https://img.shields.io/github/stars/Meituan-AutoML/MobileVLM.svg?style=social)](https://github.com/Meituan-AutoML/MobileVLM.git)

📸 Release

🦙 Model Zoo

MobileVLM Family

Model LLM GQA SQAI VQAT POPE MMEP MMBdev Avg.
56.1 57.3 41.5 84.5 1196.2 53.2 58.7
MobileVLM V2 1.7B MobileLLaMA 1.4B 59.3 66.7 52.1 84.3 1302.8 57.7 64.2
MobileVLM-3B MobileLLaMA 2.7B 59.0 61.2 47.5 84.9 1288.9 59.6 62.8
MobileVLM V2 3B MobileLLaMA 2.7B 61.1 70.0 57.5 84.7 1440.5 63.2 68.1
MobileVLM V2 7B Vicuna-7B 62.6 74.8 62.3 85.3 1560.7 69.2 72.1

MobileLLaMA Family

🔔 Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. This project is licensed permissively under the Apache 2.0 license and does not impose any additional constraints. LLaVA

🛠️ Install

  1. Clone this repository and navigate to MobileVLM folder

    git clone https://github.com/Meituan-AutoML/MobileVLM.git
    cd MobileVLM
  2. Install Package

    conda create -n mobilevlm python=3.10 -y
    conda activate mobilevlm
    pip install --upgrade pip
    pip install -r requirements.txt

🗝️ Quick Start

Example for MobileLLaMA model inference

import torch
from transformers import LlamaTokenizer, LlamaForCausalLM

model_path = 'mtgv/MobileLLaMA-1.4B-Chat'

tokenizer = LlamaTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map='auto',
)

prompt = 'Q: What is the largest animal?\nA:'
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

generation_output = model.generate(
    input_ids=input_ids, max_new_tokens=32
)
print(tokenizer.decode(generation_output[0]))

Example for MobileVLM/MobileVLM V2 model inference

from scripts.inference import inference_once
# model_path = "mtgv/MobileVLM-1.7B" # MobileVLM
model_path = "mtgv/MobileVLM_V2-1.7B" # MobileVLM V2
image_file = "assets/samples/demo.jpg"
prompt_str = "Who is the author of this book?\nAnswer the question using a single word or phrase."
# (or) What is the title of this book?
# (or) Is this book related to Education & Teaching?

args = type('Args', (), {
    "model_path": model_path,
    "image_file": image_file,
    "prompt": prompt_str,
    "conv_mode": "v1",
    "temperature": 0, 
    "top_p": None,
    "num_beams": 1,
    "max_new_tokens": 512,
    "load_8bit": False,
    "load_4bit": False,
})()

inference_once(args)

🪜 Step-by-step Tutorial

MobileVLM

The training process of MobileVLM V2 is divided into two stages:

1️⃣ Prepare MobileLLaMA checkpoints

Similar to MobileVLM, please firstly download MobileLLaMA chatbot checkpoints from huggingface website (🤗 1.7B, 2.7B). Please note that this is optional (it depends on your working environment), run the training script we provide below and the model will be automatically downloaded by the transformers library.

2️⃣ Prepare data

3️⃣ Run everything with one click!

LANGUAGE_MODEL=/path/to/your/MobileLLaMA-1.4B-Chat  # or 2.7B
VISION_MODEL=/path/to/your/clip-vit-large-patch14-336
bash run.sh mobilevlm_v2_1.7b pretrain-finetune-test ${LANGUAGE_MODEL} ${VISION_MODEL}

# (test-only) bash run.sh mobilevlm_v2_1.7b test /path/to/your/own/checkpoint
# (3B) bash run.sh mobilevlm_v2_3b pretrain-finetune-test ${LANGUAGE_MODEL} ${VISION_MODEL}

MobileLLaMA

The SFT(supervised fine-tuning) process of MobileLLaMA:

Note: You may skip MobileLLaMA training processes and directly start with MobileVLM, leveraging our pre-trained MobileLLaMA model from huggingface website (🤗 1.7B, 2.7B). .

📲 Deployment on Mobile Devices

MobileVLM now is officially supported by llama.cpp. We are looking for more cooperation with open-source communities on the deployment of mobile devices.

✏️ Reference

If you find MobileVLM or MobileLLaMA useful in your research or applications, please consider giving a star ⭐ and citing using the following BibTeX:

@article{chu2023mobilevlm,
  title={Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices},
  author={Chu, Xiangxiang and Qiao, Limeng and Lin, Xinyang and Xu, Shuang and Yang, Yang and Hu, Yiming and Wei, Fei and Zhang, Xinyu and Zhang, Bo and Wei, Xiaolin and others},
  journal={arXiv preprint arXiv:2312.16886},
  year={2023}
}

@article{chu2024mobilevlm,
  title={MobileVLM V2: Faster and Stronger Baseline for Vision Language Model},
  author={Chu, Xiangxiang and Qiao, Limeng and Zhang, Xinyu and Xu, Shuang and Wei, Fei and Yang, Yang and Sun, Xiaofei and Hu, Yiming and Lin, Xinyang and Zhang, Bo and others},
  journal={arXiv preprint arXiv:2402.03766},
  year={2024}
}

🌟 Star History

Star History Chart