UMass-Foundation-Model / FlexAttention

Apache License 2.0
19 stars 4 forks source link

FlexAttention for Efficient High-Resolution Vision-Language Models

[Project Page] [Paper]

Overview

overview

This repository contains the official code for FlexAttention for Efficient High-Resolution Vision-Language Models.

News

Installation

conda create -n flexattention python=3.9
conda activate flexattention
pip install -e .
pip install -e ".[train]"
pip install -e ./transformers

Checkpoint

You can download our 7B model checkpoint from [huggingface]() and put it into checkpoints folder.

Evaluation

TextVQA

  1. Follow this instruction to download the textvqa evaluaton images and annotation, and extract to datasets/eval/textvqa.

  2. Run the multi-gpu inference:

    torchrun --nproc_per_node 3 scripts/evaluation/eval_textvqa.py --dist --model-path checkpoints/llava-v1.5-7b-flexattn --id llava-v1.5-7b-flexattn

    It will generate a file similar to answer_textvqa_llava-v1.5-7b-flexattn_xxx.jsonl on the folder root.

  3. Run the evaluation script:

    bash scripts/evaluation/get_textvqa_score.sh ANSWER_FILE

V* Bench

  1. Download the dataset from huggingface.
git lfs install
git clone https://huggingface.co/datasets/craigwu/vstar_bench
  1. Run the multi-gpu inference:
    
    # Attribute
    torchrun --nproc_per_node 3 scripts/evaluation/eval_vbench.py --dist --model-path checkpoints/llava-v1.5-7b-flexattn --id llava-v1.5-7b-flexattn --subset direct_attributes

Spatial

torchrun --nproc_per_node 3 scripts/evaluation/eval_vbench.py --dist --model-path checkpoints/llava-v1.5-7b-flexattn --id llava-v1.5-7b-flexattn --subset relative_position


### MagnifierBench

1. Download the dataset from [here](https://drive.google.com/file/d/1DE5PBkhHMdVNOpDg6GtfzO73ZFrK9ltZ/view?usp=sharing), and extract it to `datasets/eval/`.

2. Run the multi-gpu inference:
```bash
torchrun --nproc_per_node 3 scripts/evaluation/eval_magnifier.py --dist --model-path checkpoints/llava-v1.5-7b-flexattn --id llava-v1.5-7b-flexattn

Training

Coming soon.

Acknowledgement

LLaVA: the codebase that our project build on. Thanks for their amazing code and model.

Citation

If our work is useful or relevant to your research, please kindly recognize our contributions by citing our paper:

@misc{li2024flexattention,
      title={FlexAttention for Efficient High-Resolution Vision-Language Models}, 
      author={Junyan Li and Delin Chen and Tianle Cai and Peihao Chen and Yining Hong and Zhenfang Chen and Yikang Shen and Chuang Gan},
      year={2024},
      eprint={2407.20228},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.20228}, 
}