AFeng-x / Draw-and-Understand

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Apache License 2.0
47 stars 2 forks source link
Second Image SPHINX-V Logo
## ๐ŸŽจ Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [Weifeng Lin](), [Xinyu Wei](), [Ruichuan An](), [Peng Gao]() [Bocheng Zou](), [Yulin Luo](), [Siyuan Huang](), [Shanghang Zhang]() and [Hongsheng Li]()
[![Project Page](https://img.shields.io/badge/Project-Page-green.svg)](https://draw-and-understand.github.io/) [![arXiv Paper](https://img.shields.io/badge/arxiv-2403.20271-ECA8A7?logo=arxiv)](https://arxiv.org/abs/2403.20271) [![Static Badge](https://img.shields.io/badge/Demo-6B88E3?logo=youtubegaming&logoColor=DAE4EE)](http://106.14.2.150:10020/) [![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-blue.svg)](https://github.com/AFeng-x/Draw-and-Understand/blob/main/LICENSE) [[๐ŸŒ Project Page](https://draw-and-understand.github.io/)] [[๐Ÿ“– Paper](https://arxiv.org/abs/2403.20271)] [[๐Ÿค— MDVP-Data](https://huggingface.co/datasets/Afeng-x/Draw-and-Understand/tree/main/stage_2_fine-tuning/MDVP-Data)] [[๐Ÿค— MDVP-Bench](https://huggingface.co/datasets/Afeng-x/Draw-and-Understand/tree/main/MDVP-bench)] [[๐Ÿค–๏ธ Model](https://huggingface.co/Afeng-x/SPHINX-V-Model)] [[๐ŸŽฎ Demo](http://106.14.2.150:10020/)]

๐Ÿ’ฅ News

๐Ÿ‘€ Introduction

The interaction between humans and artificial intelligence (AI) is a crucial factor that reflects the effectiveness of multimodal large language models (MLLMs). However, current MLLMs primarily focus on image-level comprehension and limit interaction to textual instructions, thereby constraining their flexibility in usage and depth of response. Therefore, we introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.


Specifically, the model is named SPHINX-V, a new multimodal large language model designed for visual prompting, equipped with a novel visual prompt encoder and a two-stage training strategy. SPHINX-V supports multiple visual prompts simultaneously across various types, significantly enhancing user flexibility and achieve a fine-grained and open-world understanding of visual prompts.


๐Ÿš€ Examples Show

๐Ÿ” Natural Image Domain


๐Ÿ” OCR Image Domain


๐Ÿ” Mobile/Website Screenshot Domain


๐Ÿ” Multi-panel Image Domain


๐Ÿ› ๏ธ Install

  1. Clone this repository and navigate to Draw-and-Understand folder
    git clone https://github.com/AFeng-x/Draw-and-Understand.git
    cd Draw-and-Understand
  2. Install packages
    # Create a new conda environment named 'sphinx-v' with Python 3.10
    conda create -n sphinx-v python=3.10 -y
    # Activate the 'sphinx-v' environment
    conda activate sphinx-v
    # Install required packages from 'requirements.txt'
    pip install -r requirements.txt
  3. Optional: Install Flash-Attention
    # Draw-and-Understand is powered by flash-attention for efficient attention computation.
    pip install flash-attn --no-build-isolation
  4. Install Draw-and-Understand as Python Package
    # go to the root directory of Draw-and-Understand
    cd Draw-and-Understand
    # install Draw-and-Understand
    pip install -e .
    # After this, you will be able to invoke โ€œimport SPHINX_Vโ€ without the restriction of working directory.
  5. To enable the segmentation ability shown in our official demo, SAM is also needed:
    pip install git+https://github.com/facebookresearch/segment-anything.git

๐Ÿค–๏ธ Checkpoints

SPHINX-V-13b Stage-1 Pre-training Weight: ๐Ÿค—Hugging Face / Baidu

SPHINX-V-13b Stage-2 Fine-tunings Weight: ๐Ÿค—Hugging Face / Baidu

Other required weights and configurations: ๐Ÿค—Hugging Face

Please download them to your own machine. The file structure should appear as follows:

accessory/checkpoints/sphinx-v/stage2
โ”œโ”€โ”€ consolidated.00-of-02.model.pth
โ”œโ”€โ”€ consolidated.01-of-02.model.pth
โ”œโ”€โ”€ tokenizer.model
โ”œโ”€โ”€ config.json
โ””โ”€โ”€ meta.json
accessory/checkpoints/llama-2-13b
โ”œโ”€โ”€ params.json

accessory/checkpoints/tokenizer
โ”œโ”€โ”€ tokenizer.model

๐Ÿ“ MDVP-Dataset


๐Ÿš€ Training

๐Ÿ“ˆ Evaluation

See evaluation for details.

๐Ÿ›ฉ๏ธ Inference

We provide a simple example for inference in inference.py

You can launch this script with torchrun --master_port=1112 --nproc_per_node=1 inference.py

๐Ÿช Host Local Demo

๐Ÿ’ป requirments:

  1. For this demo, it needs to prepare the SPHINX-V stage-2 checkpoints and ViT-H SAM model, and place them in the accessory/checkpoints/ directory.
  2. Make sure you have installed Segment Anything.
  3. Run.
    cd accessory/demos
    bash run.sh

๐Ÿ’Œ Acknowledgement

๐Ÿ–Š๏ธ: Citation

If you find our Draw-and-Understand project useful for your research and applications, please kindly cite using this BibTeX:

@misc{lin2024drawandunderstand,
      title={Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want}, 
      author={Weifeng Lin and Xinyu Wei and Ruichuan An and Peng Gao and Bocheng Zou and Yulin Luo and Siyuan Huang and Shanghang Zhang and Hongsheng Li},
      year={2024},
      eprint={2403.20271},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}