[CVPR2024] Prompt Highlighter: Interactive Control for Multi-Modal LLMs

This is the official implementation of the CVPR2024 paper Prompt Highlighter: Interactive Control for Multi-Modal LLMs.

Control text generation by highlighting your prompt! Prompt Highlighter is a training-free inference pipeline that facilitates token-level user interactions for a customized generation. Our method is compatible with both LLMs and VLMs.

teaser

Overview

Prompt Highlighter: Interactive Control for Multi-Modal LLMs

MileStones

[X] 20231130 LLaMA attention modification & LLaVA descriptive task inference.
[X] 20231130 Test data & mask upload.
[X] 20231201 LLaVA highlighter benchmark test inference (MMBench & MME)
[X] 20231201 LLaVA partial highlight inference
[X] 20231202 Vicuna (LLM) partial highlight inference
[X] 20231202 InstructBLIP partial highlight inference
[ ] 20231204 Current Code Release!
[ ] TBD InternLM-VLComposer benchmark test inference

Quick Start

Basic enviornment setup:

conda create -n highlighter python=3.10 -y
conda activate highlighter
pip install -r requirements.txt

LLaVA

Install latest LLaVA model 2023-11-30 in base_models. If you already have one, you can use the installed one in your own enviornment.

# you may also use your installed llava if you have installed.
cd base_models
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA

pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Model Download: Please refer to LLaVAv1.5 Model Zoo to get the base pretrained model.

Partial Highlighting task: We provide examples in assets/test_data/questions_descriptions.json, you may add your new case to test our method.

python examples/llava_test.py

Descriptive task (highlighting all input contexts): We provide examples in assets/test_data/questions_descriptions.json, you may add your new case to test our method.

python examples/llava_descriptions.py

We will also provide a script for descriptive COCO caption generation (TODO here).

If you want to add your customized data, please provide a squared image that uses a darker (uint color < 128) marked region as the highlighter area. Add your case to the JSON file.

Benchmark Test: Please refer to evaluation data to get your benchmark dataset (MMBench & MME). Benchmark result:

Method	MME-perception	MMBench-dev	MMBench-test
Baseline (LLaVAv1.5-13B)	1531.3	67.7	67.0
Ours (Official Reported)	1552.5	69.7	69.5
Ours (This Repo)	1552.5	70.1	70.7

For MMBench, you may change your hyper-params in the following script and run:

bash examples/eval_scripts/mmbench_dev_hl.sh
bash examples/eval_scripts/mmbench_test_hl.sh

For MME:

bash examples/eval_scripts/mme_hl.sh

You may found the evaluated metric at base_models/LLaVA/playground/data/eval/MME/eval_tool/answers/llava-v1.5-13b-hl-1.3-2.0-0.01/eval.log

Vicuna (LLaMA-based LLMs)

We provide a script to test the partial highlighter of the pure language input. Download the Vicuna model. We use the version Vicuna-13B-v1.1. You may change to any llama-based LLMs. In this case, you will also need to change the conversation prompt template. Please follow the instructions to - install the LLaVA in the base_model. If you have already installed the LLaVA, you may directly test with the script:

python examples/llama_test.py \
    --txt "Please write a summary of A Mid-Summer Nights' Dream, make it compact." \
    --hl "make it compact."

Here you may change your input prompt and highlighted segments by passing --txt and --hl, respectively. If you want to pass multiple highlighted segments, you may use a <s> to split them. For example, you can pass --hl "write a summary<s>make it compact." to highlight multiple requirements.

InstructBLIP

Install the latest LAVIS 2023-11-30 in base_models. If you already have one, you can use the installed one in your own environment.

To run the InstructBLIP-Vicuna, you need to add the LLM path (vicuna-13b v1.1) to the key llm_model in the configuration file base_models/LAVIS/lavis/configs/models/blip2/blip2_instruct_vicuna13b.yaml.

# Please install with your highlighter env activated.
cd base_models
git clone https://github.com/salesforce/LAVIS.git
cd LAVIS
pip install -e .

Partial Highlighting task: Run examples in assets/test_data/questions_descriptions.json, you may add your new case to test our method.

Note: Here, we only implement a highlighting mechanism in the QFormer. We may update a hybrid highlighting (visual & text token) version in the future.

python examples/instructblip_test.py

InternLM-VLComposer

TBD.

Method

pipeline

An abstract pipeline of Prompt Highlighter. Users can control the focus of generation by marking out specific image regions or text spans. Then a token-level mask $\mathbf{m}$ is created to guide the language model's inference. Motivated by the classifier-free diffusion guidance, we form regular and unconditional context pairs based on highlighted tokens, demonstrating that the autoregressive generation in models can be guided in a classifier-free way. Notably, we find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs. ## Cite Prompt Highlighter If you find this repo useful for your research, please consider citing the paper ``` @inproceedings{zhang2024prompt, title={Prompt Highlighter: Interactive Control for Multi-Modal LLMs}, author={Zhang, Yuechen and Qian, Shengju and Peng, Bohao and Liu, Shu and Jia, Jiaya}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={13215--13224}, year={2024} } ``` ## Acknowledgement We would like to thank the following repos for their great work: - This work utilizes multi-modal LLMs with base models in [LLaVA](https://github.com/haotian-liu/LLaVA), [Vicuna](https://github.com/lm-sys/FastChat), [InstructBLIP](https://github.com/salesforce/LAVIS), and [InternLM-VLComposer](https://github.com/InternLM/InternLM-XComposer). - This work utilizes the logit processor referenced in [CFG-LLM](https://github.com/huggingface/transformers/issues/24536). - Part of the logo at the top of this page is generated with [Bing Image Creator](http://bing.com/images/create).

dvlab-research / Prompt-Highlighter

readme