baaivision / EVE

[NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models
MIT License
236 stars 3 forks source link
clip encoder-free-vlm instruction-following large-language-models llm mllm multimodal-large-language-models vision-language-models vlm

EVE: Unveiling Encoder-Free Vision-Language Models

PyTorch implementation for NeurIPS2024 (spotlight) paper of Unveiling Encoder-Free Vision-Language Models.

πŸ“œ News

[2024/09/26] πŸ”₯πŸ”₯πŸ”₯ Our EVE has been accepted by NeurIPS 2024 (spotlight) ! πŸ’₯
[2024/07/01] We release training code and EVE-7B weights ! πŸš€
[2024/06/23] We release evaluation code, EVE-7B-Pretrain, and EVE-7B-HD weights ! πŸš€
[2024/06/18] The paper is released ! πŸ’₯

πŸ’‘ Motivation

πŸ›Έ Architecture

πŸ’‘ Highlights

πŸ€– Model Zoo

The usage of EVE checkpoints should comply with the base LLM's model license: Llama 2.

Model LLM Weight VQAv2 GQA VizWiz SQA_I TextVQA POPE MME_P MMBench SEED/SEED_I MM_Vet
EVE_7B_Pretrain Vicuna_7B HF_link -- -- -- -- -- -- -- -- -- / -- --
EVE_7B Vicuna_7B HF_link 75.4 60.8 41.8 63.0 51.9 83.6 1217.3 49.5 54.3 / 61.3 25.6
EVE_7B_HD Vicuna-7B HF_link 78.6 62.6 51.1 64.9 56.8 85.0 1305.7 52.3 56.8 / 64.6 25.7

πŸ‘¨β€πŸ’» Todo List

Contents

Install

Environment

git clone https://github.com/baaivision/EVE.git
cd EVE
conda create -n eve_envs python=3.10 -y
conda activate eve_envs

pip install --upgrade pip
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Preparation

Download vicuna_model and extract them into lmsys/ path:

Download preprocessor and extract them into openai/ path:

lmsys
β”œβ”€β”€ vicuna-7b-v1.5
β”‚   │── config.json
β”‚   │── ...
openai
β”œβ”€β”€ clip-vit-large-patch14-336
β”‚   │── config.json
β”‚   │── ...
β”œβ”€β”€ eve-patch14-anypixel-672
β”‚   │── preprocessor_config.json
β”‚   │── ...
β”œβ”€β”€ eve-patch14-anypixel-1344
β”‚   │── preprocessor_config.json
β”‚   │── ...

Quick Usage

Example Code
from eve.model.builder import load_pretrained_model
from eve.mm_utils import get_model_name_from_path
from eve.eval.run_eve import eval_model

model_path = "Absolute Path of BAAI/EVE-7B-HD-v1.0"

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

Check out the details wth the load_pretrained_model function in eve/model/builder.py.

You can also use eve/eval/eval_one_sample.py to get the output easily. By doing so, you can use this code on Colab directly after downloading this repository.

# run script
CUDA_VISIBLE_DEVICES=0 python eve/eval/eval_one_sample.py

Demo

You can also build up your local demo using the following script:

# run script
python tools/app.py

Data

You should follow this instruction Data.md to manage the datasets. Currently, we provide direct download access to the web data. However, to avoid potential disputes, we plan to release URLs for these datasets rather than the raw data in the near future.

Train

(1) LLM-guided Pre-aligning Stage: we only adopt 16M of 33M image-text data (EVE-cap16/33M) to train patch embedding and aligning layers. It really does count for efficient training, as it prevents collapse and accelerates convergence throughout the entire process.

Model Epoch Batch_Size Learning_Rate LR_Schedule Warmup_Ratio Max_Length Weight_decay Optimizer DeepSpeed
EVE_Prealign 1 512 4e-4 cosine decay 0.03 2048 0 AdamW zero3

Training script for EVE_Prealign as follows:

bash scripts/eve/eve7b_prealign.sh ${node_rank} ${master_addr}

(2) Generative Pre-training Stage: we use all 33M image-text pairs (EVE-cap33M) to train patch embedding and aligning layers, and the full LLM modules.

Model Epoch Batch_Size Learning_Rate LR_Schedule Warmup_Ratio Max_Length Weight_decay Optimizer DeepSpeed
EVE_Pretrain 1 512 4e-5 cosine decay 0.01 2048 0 AdamW zero3

Training script for EVE_Pretrain as follows:

bash scripts/eve/eve7b_pretrain.sh ${node_rank} ${master_addr}

(3) Supervised Fine-tuning Stage: We finetune the entire architecture with LLaVA-mix-665K for EVE-7B and extra 1.2M SFT conversation data for EVE-7B (HD).

Model Epoch Batch_Size Learning_Rate LR_Schedule Warmup_Ratio Max_Length Weight_decay Optimizer DeepSpeed
EVE_Finetune 1 128 2e-5 cosine decay 0.01 2048/4096 0 AdamW zero3

Training scripts for EVE_7B and EVE_7B_HD as follows:

bash scripts/eve/eve7b_finetune.sh ${node_rank} ${master_addr}
bash scripts/eve/eve7b_finetune_hd.sh ${node_rank} ${master_addr}

[NOTE]:
To train on fewer GPUs, you can reduce the per_device_train_batch_size and increase the gradient_accumulation_steps accordingly. Always keep the global batch size the same: per_device_train_batch_size x gradient_accumulation_steps x num_gpus.

Evaluation

To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.

See Evaluation.md.

❀️ Acknowledgments

βœ’οΈ Citation

If EVE is helpful for your research, please consider star ⭐ and citation πŸ“ :

@article{diao2024EVE,
  title={Unveiling Encoder-Free Vision-Language Models},
  author={Diao, Haiwen and Cui, Yufeng and Li, Xiaotong and Wang, Yueze and Lu, Huchuan and Wang, Xinlong},
  journal={arXiv preprint arXiv:2406.11832},
  year={2024}
}

πŸ“„ License

The content of this project itself is licensed under LICENSE.