EvolvingLMMs-Lab / lmms-eval

Accelerating the development of large multimodal models (LMMs) with lmms-eval
https://lmms-lab.github.io/
Other
1.39k stars 111 forks source link

Add `llava` model for 🤗 Transformers #47

Closed lewtun closed 5 months ago

lewtun commented 5 months ago

This PR adds the modelling code needed to evaluate llava models in the transformers format: https://huggingface.co/collections/llava-hf/llava-15-65f762d5b6941db5c2ba07e0

Example command to run:

accelerate launch --num_processes=8 -m lmms_eval --model llava_hf   --model_args pretrained="llava-hf/llava-1.5-7b-hf"   --tasks mme --batch_size 1 --output_path ./logs/ --log_samples

I will share some benchmark numbers shortly, but the code can be reviewed in any case :)

kcz358 commented 5 months ago

Looks quite good for me, I think adding the device_map=auto option and loglikelihood support then should be able to merge

lewtun commented 5 months ago

Looks quite good for me, I think adding the device_map=auto option and loglikelihood support then should be able to merge

I'm working on loglikelihood support - is there a way to test it works as expected? If you can point me to a benchmark or command to run, that would be very helpful!

kcz358 commented 5 months ago

Looks quite good for me, I think adding the device_map=auto option and loglikelihood support then should be able to merge

I'm working on loglikelihood support - is there a way to test it works as expected? If you can point me to a benchmark or command to run, that would be very helpful!

Hi, I think you can try seedbench_ppl which is a multiple_choice output type and appending the options one by one to the context to calculate the loglikelihood.

Or you can use this yaml file which is revised on llava_in_the_wild

dataset_path: lmms-lab/llava-bench-coco
dataset_kwargs:
  token: True
task: "llava_in_the_wild_ppl"
test_split: train
output_type: loglikelihood
doc_to_visual: !function utils.llava_doc_to_visual
doc_to_text: !function utils.llava_doc_to_text
doc_to_target: "gpt_answer"
metric_list:
  - metric: perplexity
    higher_is_better: true
metadata:
  version: 0.0
model_specific_prompt_kwargs:
  default:
    pre_prompt: ""
    post_prompt: ""

This will test the model's perplexity on a generation task

lewtun commented 5 months ago

Hello @kcz358 @jzhang38 I've now tidied up the code and pushed support for:

I also ran the 7B model over several benchmarks to compare against the original llava implementation. In some we have good agreement, while in others there is some significant difference. One possible reason is that the image processing differs across implementations (see here) and/or some slight differences in how the inputs are formatted.

Screenshot 2024-04-09 at 14 40 40

Spreadsheet: https://docs.google.com/spreadsheets/d/1CbV-SOSVNl1S60Ns8B0-DhHBH5k5zPAm9M6XcpwFG5w/edit?usp=sharing

Do you have some ideas about why e.g. mme can be so different, given that other benchmarks like mmbench and mmmu are quite similar?

For the loglikelihood benchmarks, here's the chat template that is being applied (inspired by the llava code):

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image> <image> <image> <image> <image> <image> <image> <image>
Please identify the sequence of actions in this video and record them in sequence. Answer :  ASSISTANT:  scoop sugar, pour milk, carry milk, reach cup, carry cup, reach cup</s>

Please let me know if this is not correct, e.g. should the EOS token be omitted?

Edit: I double checked the prompt template for the llava implementation of loglikelihood and spotted a bug in llava.py. Fixed in https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/47/commits/7c7b9699af057ca0d750e7cb1bc8e3731c22e852

kcz358 commented 5 months ago

Wow your work is amazing @lewtun ! Currently all seem quite good to me and thank you very much for spotting out the loglikelihood issue for us.

For mme disagreement, have you checked that the prompt are exactly the same for hf version and the llava version?

Also, I think just like you mentioned, different image processing implementation will also affect the score. Based on some of the tests in our development, this could cause some significant shift to the score. I checked the eval scripts of the llava and seems like the image processing implementation llava 1.5 used on mme is pad the image to a square.

https://github.com/haotian-liu/LLaVA/blob/4e2277a060da264c4f21b364c867cc622c945874/llava/mm_utils.py#L152-L163

Another factor that may affect the final score is the torch version you use. We provide a reproduce environment here that can exactly reproduce mme score on llava. Whether you use flash attn or not may also affect the score a bit but not too much and can be ignored

lewtun commented 5 months ago

For mme disagreement, have you checked that the prompt are exactly the same for hf version and the llava version?

Yes, I've checked they are exactly the same which suggests image processing is the culprit.

Another factor that may affect the final score is the torch version you use. We provide a reproduce environment here that can exactly reproduce mme score on llava. Whether you use flash attn or not may also affect the score a bit but not too much and can be ignored

Thanks, I am using torch==2.1.2 which produces an MME score of 1513.673 for llava which is compatible with the paper. I know there are plans to enable the same padding logic for llava_hf models, so perhaps we can merge this as-is and revisit MME at a future date?

kcz358 commented 5 months ago

Thanks, I am using torch==2.1.2 which produces an MME score of 1513.673 for llava which is compatible with the paper. I know there are plans to enable the same padding logic for llava_hf models, so perhaps we can merge this as-is and revisit MME at a future date?

Yeah I think this is okay for now since for most of the benchmark the scores are similar

lewtun commented 5 months ago

Yeah I think this is okay for now since for most of the benchmark the scores are similar

Great! Any chance we could merge this soon? We are working on VLM integration in trl and would like to point the community to lmms-eval for the release :)

kcz358 commented 5 months ago

Great! Any chance we could merge this soon? We are working on VLM integration in trl and would like to point the community to lmms-eval for the release :)

Hi @Luodian, most of the parts of this PR LGTM for me. Do you think we can merge it now or wait until next release? You might also wanna review the changes and see whether there are things that need to change.

Luodian commented 5 months ago

Great! Any chance we could merge this soon? We are working on VLM integration in trl and would like to point the community to lmms-eval for the release :)

Hi @Luodian, most of the parts of this PR LGTM for me. Do you think we can merge it now or wait until next release? You might also wanna review the changes and see whether there are things that need to change.

Hi I think it can be merged directly, but let me see the changes and after checking I will merge it~