Accelerating the development of large multimodal models (LMMs) with
lmms-eval
π LMMs-Lab Homepage | π Blog | π Documentation | π€ Huggingface Datasets | discord/lmms-eval
[2024-06] π¬π¬ The lmms-eval/v0.2
has been upgraded to support video evaluations for video models like LLaVA-NeXT Video and Gemini 1.5 Pro across tasks such as EgoSchema, PerceptionTest, VideoMME, and more. Please refer to the blog for more details
[2024-03] ππ We have released the first version of lmms-eval
, please refer to the blog for more details
lmms-eval
?
In today's world, we're on an exciting journey toward creating Artificial General Intelligence (AGI), much like the enthusiasm of the 1960s moon landing. This journey is powered by advanced large language models (LLMs) and large multimodal models (LMMs), which are complex systems capable of understanding, learning, and performing a wide variety of human tasks.
To gauge how advanced these models are, we use a variety of evaluation benchmarks. These benchmarks are tools that help us understand the capabilities of these models, showing us how close we are to achieving AGI.
However, finding and using these benchmarks is a big challenge. The necessary benchmarks and datasets are spread out and hidden in various places like Google Drive, Dropbox, and different school and research lab websites. It feels like we're on a treasure hunt, but the maps are scattered everywhere.
In the field of language models, there has been a valuable precedent set by the work of lm-evaluation-harness. They offer integrated data and model interfaces, enabling rapid evaluation of language models and serving as the backend support framework for the open-llm-leaderboard, and has gradually become the underlying ecosystem of the era of foundation models.
We humbly obsorbed the exquisite and efficient design of lm-evaluation-harness and introduce lmms-eval, an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM.
For formal usage, you can install the package from PyPI by running the following command:
pip install lmms-eval
For development, you can install the package by cloning the repository and running the following command:
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip install -e .
If you wanted to test llava, you will have to clone their repo from LLaVA and
# for llava 1.5
# git clone https://github.com/haotian-liu/LLaVA
# cd LLaVA
# pip install -e .
# for llava-next (1.6)
git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT
pip install -e .
If you want to test on caption dataset such as coco
, refcoco
, and nocaps
, you will need to have java==1.8.0
to let pycocoeval api to work. If you don't have it, you can install by using conda
conda install openjdk=8
you can then check your java version by java -version
Our Development will be continuing on the main branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub.
Evaluation of LLaVA on MME
python3 -m accelerate.commands.launch \
--num_processes=8 \
-m lmms_eval \
--model llava \
--model_args pretrained="liuhaotian/llava-v1.5-7b" \
--tasks mme \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_v1.5_mme \
--output_path ./logs/
Evaluation of LLaVA on multiple datasets
python3 -m accelerate.commands.launch \
--num_processes=8 \
-m lmms_eval \
--model llava \
--model_args pretrained="liuhaotian/llava-v1.5-7b" \
--tasks mme,mmbench_en \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_v1.5_mme_mmbenchen \
--output_path ./logs/
For other variants llava. Please change the conv_template
in the model_args
conv_template
is an arg of the init function of llava inlmms_eval/models/llava.py
, you could find the corresponding value at LLaVA's code, probably in a dict variableconv_templates
inllava/conversations.py
python3 -m accelerate.commands.launch \
--num_processes=8 \
-m lmms_eval \
--model llava \
--model_args pretrained="liuhaotian/llava-v1.6-mistral-7b,conv_template=mistral_instruct" \
--tasks mme,mmbench_en \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_v1.5_mme_mmbenchen \
--output_path ./logs/
Evaluation of larger lmms (llava-v1.6-34b)
python3 -m accelerate.commands.launch \
--num_processes=8 \
-m lmms_eval \
--model llava \
--model_args pretrained="liuhaotian/llava-v1.6-34b,conv_template=mistral_direct" \
--tasks mme,mmbench_en \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_v1.5_mme_mmbenchen \
--output_path ./logs/
Evaluation with a set of configurations, supporting evaluation of multiple models and datasets
python3 -m accelerate.commands.launch --num_processes=8 -m lmms_eval --config ./miscs/example_eval.yaml
Evaluation with naive model sharding for bigger model (llava-next-72b)
python3 -m lmms_eval \
--model=llava \
--model_args=pretrained=lmms-lab/llava-next-72b,conv_template=qwen_1_5,device_map=auto,model_name=llava_qwen \
--tasks=pope,vizwiz_vqa_val,scienceqa_img \
--batch_size=1 \
--log_samples \
--log_samples_suffix=llava_qwen \
--output_path="./logs/" \
--wandb_args=project=lmms-eval,job_type=eval,entity=llava-vl
Evaluation with SGLang for bigger model (llava-next-72b)
python3 -m lmms_eval \
--model=llava_sglang \
--model_args=pretrained=lmms-lab/llava-next-72b,tokenizer=lmms-lab/llavanext-qwen-tokenizer,conv_template=chatml-llava,tp_size=8,parallel=8 \
--tasks=mme \
--batch_size=1 \
--log_samples \
--log_samples_suffix=llava_qwen \
--output_path=./logs/ \
--verbosity=INFO
Please check supported models for more details.
Please check supported tasks for more details.
Please refer to our documentation.
lmms_eval is a fork of lm-eval-harness. We recommend you to read through the docs of lm-eval-harness for relevant information.
Below are the changes we made to the original API:
During the initial stage of our project, we thank:
During the v0.1
to v0.2
, we thank the community support from pull requests (PRs):
Details are in lmms-eval/v0.2.0 release notes
Datasets:
Models:
@misc{lmms_eval2024,
title={LMMs-Eval: Accelerating the Development of Large Multimoal Models},
url={https://github.com/EvolvingLMMs-Lab/lmms-eval},
author={Bo Li*, Peiyuan Zhang*, Kaichen Zhang*, Fanyi Pu*, Xinrun Du, Yuhao Dong, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li and Ziwei Liu},
publisher = {Zenodo},
version = {v0.1.0},
month={March},
year={2024}
}