[2024.03.28] ๐ฅ We released the MDVP-Data dataset and MDVP-Bench benchmark.
[2024.03.28] ๐ฅ We released the SPHINX-V-13B model and online demo.
[2024.03.28] ๐ We release the arXiv paper.
[2024.03.28] ๐ We released the traning and evaluation code.
The interaction between humans and artificial intelligence (AI) is a crucial factor that reflects the effectiveness of multimodal large language models (MLLMs). However, current MLLMs primarily focus on image-level comprehension and limit interaction to textual instructions, thereby constraining their flexibility in usage and depth of response. Therefore, we introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, the model is named SPHINX-V, a new multimodal large language model designed for visual prompting, equipped with a novel visual prompt encoder and a two-stage training strategy. SPHINX-V supports multiple visual prompts simultaneously across various types, significantly enhancing user flexibility and achieve a fine-grained and open-world understanding of visual prompts.
git clone https://github.com/AFeng-x/Draw-and-Understand.git
cd Draw-and-Understand
# Create a new conda environment named 'sphinx-v' with Python 3.10
conda create -n sphinx-v python=3.10 -y
# Activate the 'sphinx-v' environment
conda activate sphinx-v
# Install required packages from 'requirements.txt'
pip install -r requirements.txt
# Draw-and-Understand is powered by flash-attention for efficient attention computation.
pip install flash-attn --no-build-isolation
# go to the root directory of Draw-and-Understand
cd Draw-and-Understand
# install Draw-and-Understand
pip install -e .
# After this, you will be able to invoke โimport SPHINX_Vโ without the restriction of working directory.
pip install git+https://github.com/facebookresearch/segment-anything.git
SPHINX-V-13b Stage-1 Pre-training Weight: ๐คHugging Face / Baidu
SPHINX-V-13b Stage-2 Fine-tunings Weight: ๐คHugging Face / Baidu
Other required weights and configurations: ๐คHugging Face
Please download them to your own machine. The file structure should appear as follows:
accessory/checkpoints/sphinx-v/stage2
โโโ consolidated.00-of-02.model.pth
โโโ consolidated.01-of-02.model.pth
โโโ tokenizer.model
โโโ config.json
โโโ meta.json
accessory/checkpoints/llama-2-13b
โโโ params.json
accessory/checkpoints/tokenizer
โโโ tokenizer.model
MDVP-Data is a comprehensive dataset for multi-domain visual-prompt instruction tuning. This dataset encompasses data for both point-level and region-level understanding, designed to enhance a modelโs comprehension ability and robustness.
Based on MDVP-Data, we also introduce MDVP-Bench, a challenging benchmark designed to evaluate tasks that require a combination of detailed description referrals, inter-relationship analysis, and complex reasoning.
Prepare data
Stage 1: Image-Visual Prompt-Text Alignment Pre-training
bash scripts/train_sphinx-v_pretrain_stage1.sh
.Stage 2: Multi-Task End-to-End Supervised Finetuning
bash scripts/train_sphinx-v_finetune_stage2.sh
.See evaluation for details.
We provide a simple example for inference in inference.py
You can launch this script with torchrun --master_port=1112 --nproc_per_node=1 inference.py
๐ป requirments:
accessory/checkpoints/
directory.cd accessory/demos
bash run.sh
If you find our Draw-and-Understand project useful for your research and applications, please kindly cite using this BibTeX:
@misc{lin2024drawandunderstand,
title={Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want},
author={Weifeng Lin and Xinyu Wei and Ruichuan An and Peng Gao and Bocheng Zou and Yulin Luo and Siyuan Huang and Shanghang Zhang and Hongsheng Li},
year={2024},
eprint={2403.20271},
archivePrefix={arXiv},
primaryClass={cs.CV}
}