bronyayang / Law_of_Vision_Representation_in_MLLMs

Official implementation of the Law of Vision Representation in MLLMs
123 stars 7 forks source link

Icon Law of Vision Representation in MLLMs

arXiv / HuggingFace / More Thoughts (Blog in English) / More Thoughts (Blog in Chinese)

Visualization of the law



Clone This Repository

git clone
cd Law_of_Vision_Representation_in_MLLMs

Train LLaVA with Custom Vision Representation

1. Install the LLaVA Environment: Ensure that the environment is compatible with your custom vision module

conda create -n ac_llava python=3.10 -y
conda activate ac_llava
pip install --upgrade pip

This training environment has been tested on CUDA 12.2 and is compatible with all the encoders mentioned in the paper, except for OpenCLIP (refer to environment record for details on OpenCLIP compatibility).

To run SD3 vision representation, you'll need to install the diffusers package from the repository. Follow these steps:

cd diffusers
pip install -e .

Important Note:

To accommodate diffusion model encoders, this environment includes the diffusers, xformers, and transformers packages. However, these packages may conflict with each other. It is strongly advised to modify pyproject.toml and install only the packages required for your custom vision encoder, rather than all 10 encoders simultaneously.

2. Stage 1 Training

Prepare LLaVA Stage 1 Data: Follow the instructions in LLaVA's tutorial to prepare the data for Stage 1 training.

Start Training: Use the following command to start training:

bash llava/scripts/v1_5/train/

However, before running the command, ensure that you modify the following parameters in the script:

Available Vision Towers:

Note: To combine features from multiple vision towers, use a dot . between the names. For example: openai/clip-vit-large-patch14.facebook/dinov2-large

3. Stage 2 Training

Prepare LLaVA Stage 2 Data: Follow the instructions in LLaVA's tutorial to prepare the data for Stage 2 training.

Start Training: Use the following command to start training:

bash llava/scripts/v1_5/train/

However, before running the command, ensure that you modify the following parameters in the script:

Pretrained Weights

If you prefer to use the same vision representations that we tested in our paper, we have released pretrained weights in Hugging Face for your convenience. This allows you to bypass the steps mentioned above and proceed directly to the next sections.


We use lmms-eval to evaluate the benchmark performance for MLLMs on various vision representations and to extract features from benchmark images for calculating the A score.

1. Install the lmms-eval Environment

cd llava/eval/lmms-eval
pip install -e .

2. Evaluate

To evaluate the model, use the following command:

accelerate launch --num_processes=8 -m lmms_eval --model llava   --model_args pretrained="path-to-stage-2-checkpoint"   --tasks task1 --batch_size 1 --log_samples --log_samples_suffix llava_custom_task1 --output_path ./logs/

For more information, refer to the original lmms-eval repository or the README in this repository.

3. Visual Embedding Extraction from Benchmark Data

Our A score calculation requires visual embeddings extracted from benchmark data. This process also requires the stage 1 checkpoint to be loaded. The following method is suggested as a starting point and is not intended to encourage hardcoding:

1. Set the Random Seed: Uncomment the code in lmms-eval/lmms_eval/models /, lines 38-51

2. Enable Stage 1 Loading: Uncomment lines 105 and lines 111 in lmms-eval/lmms_eval/models/

3. Save the Visual Embeddings: Uncomment lines 476 in llava/model/

4. Run the Eval Command:

Once everything is set up, run the evaluation command:

accelerate launch --num_processes=8 -m lmms_eval --model llava   --model_args pretrained="path-to-stage-1-checkpoint"   --tasks task1 --batch_size 1 --log_samples --log_samples_suffix llava_custom_task1 --output_path ./logs/

To extract different vision representations across various benchmarks, refer to this script.

AC Compute

1. Install the Environment for Computing AC Score

The environment setup is adapted from Telling Left from Right. If you encounter any issues, refer to the original repository and their issue tracker.

conda create -n ac_score python=3.9
conda activate ac_score
conda install pytorch=1.13.1 torchvision=0.14.1 pytorch-cuda=11.6 -c pytorch -c nvidia
conda install -c "nvidia/label/cuda-11.6.1" libcusolver-dev
cd C_score
pip install -e .

A Score

Prepare Vision Embeddings: Ensure that you have the vision embeddings for CLIP@224, CLIP@336, and your target vision embeddings stored in the path /any/path/benchmark. You can find more details on how to extract these embeddings here.

Change Base Folder and Target Vision Representation Settings: Modify the base_folder variable on line 7 in A_score/ to point to the folder where you saved the vision embeddings. Also, update the subfolders variable to reflect the subfolder names that correspond to the vision representations for which you want to compute the A score.

Run the A Score Computation:

cd A_score

The A score will be printed to the console. Optionally, you can save the output to a CSV file for use in the AC policy section.

C Score

Prepare Vision Features on SPair-71k: First, download the SPair-71k dataset:

cd  C_score
bash data/

Next, modify the input_path and output_path variables starting at line 16 in C_score/

Additionally, modify the feature variable at line 23 to specify the vision representation you want to extract.

Run the following command to extract the features:


Run the C Score Computation: Once the features are extracted, you can compute the C score with the following command:

python --config configs/eval_zero_shot_spair.yaml

The results will be logged.

If you wish to run feature combination, use the script and the configs/eval_zero_shot_spair_two.yaml configuration file, which concatenates features along the channel dimension.

AC Policy

Under Reconstruction...


I aim to provide and maintain this repository in an easy-to-use form for everyone. However, please note that I am the sole maintainer of this codebase and have limited bandwidth. Before the process of cleaning up the code, I lost access to compute clusters and GPUs, which means some parts of the tutorial, such as environment setup and feature extraction, may be hardcoded or less than ideal, and the overall structure could be improved.

Make sure to reproduce the AC score in Appendix before you compute your own, and reflect any issue in GitHub. I would greatly appreciate any pull requests (PRs) to help enhance this repository. Your contributions are highly valued! Many thanks! ☺️


If you find this project useful, please cite our work:

  title={Law of Vision Representation in MLLMs},
  author={Yang, Shijia and Zhai, Bohan and You, Quanzeng and Yuan, Jianbo and Yang, Hongxia and Xu, Chenfeng},
  journal={arXiv preprint arXiv:2408.16357},
