Closed lewtun closed 5 months ago
Looks quite good for me, I think adding the device_map=auto
option and loglikelihood support then should be able to merge
Looks quite good for me, I think adding the
device_map=auto
option and loglikelihood support then should be able to merge
I'm working on loglikelihood support - is there a way to test it works as expected? If you can point me to a benchmark or command to run, that would be very helpful!
Looks quite good for me, I think adding the
device_map=auto
option and loglikelihood support then should be able to mergeI'm working on loglikelihood support - is there a way to test it works as expected? If you can point me to a benchmark or command to run, that would be very helpful!
Hi, I think you can try seedbench_ppl which is a multiple_choice output type and appending the options one by one to the context to calculate the loglikelihood.
Or you can use this yaml file which is revised on llava_in_the_wild
dataset_path: lmms-lab/llava-bench-coco
dataset_kwargs:
token: True
task: "llava_in_the_wild_ppl"
test_split: train
output_type: loglikelihood
doc_to_visual: !function utils.llava_doc_to_visual
doc_to_text: !function utils.llava_doc_to_text
doc_to_target: "gpt_answer"
metric_list:
- metric: perplexity
higher_is_better: true
metadata:
version: 0.0
model_specific_prompt_kwargs:
default:
pre_prompt: ""
post_prompt: ""
This will test the model's perplexity on a generation task
Hello @kcz358 @jzhang38 I've now tidied up the code and pushed support for:
transformers
so that users don't have to install the GitHub Llava repoI also ran the 7B model over several benchmarks to compare against the original llava
implementation. In some we have good agreement, while in others there is some significant difference. One possible reason is that the image processing differs across implementations (see here) and/or some slight differences in how the inputs are formatted.
Spreadsheet: https://docs.google.com/spreadsheets/d/1CbV-SOSVNl1S60Ns8B0-DhHBH5k5zPAm9M6XcpwFG5w/edit?usp=sharing
Do you have some ideas about why e.g. mme
can be so different, given that other benchmarks like mmbench
and mmmu
are quite similar?
For the loglikelihood benchmarks, here's the chat template that is being applied (inspired by the llava
code):
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image> <image> <image> <image> <image> <image> <image> <image>
Please identify the sequence of actions in this video and record them in sequence. Answer : ASSISTANT: scoop sugar, pour milk, carry milk, reach cup, carry cup, reach cup</s>
Please let me know if this is not correct, e.g. should the EOS token be omitted?
Edit: I double checked the prompt template for the llava
implementation of loglikelihood
and spotted a bug in llava.py
. Fixed in https://github.com/EvolvingLMMs-Lab/lmms-eval/pull/47/commits/7c7b9699af057ca0d750e7cb1bc8e3731c22e852
Wow your work is amazing @lewtun ! Currently all seem quite good to me and thank you very much for spotting out the loglikelihood issue for us.
For mme disagreement, have you checked that the prompt are exactly the same for hf version and the llava version?
Also, I think just like you mentioned, different image processing implementation will also affect the score. Based on some of the tests in our development, this could cause some significant shift to the score. I checked the eval scripts of the llava and seems like the image processing implementation llava 1.5 used on mme is pad the image to a square.
Another factor that may affect the final score is the torch version you use. We provide a reproduce environment here that can exactly reproduce mme score on llava. Whether you use flash attn or not may also affect the score a bit but not too much and can be ignored
For mme disagreement, have you checked that the prompt are exactly the same for hf version and the llava version?
Yes, I've checked they are exactly the same which suggests image processing is the culprit.
Another factor that may affect the final score is the torch version you use. We provide a reproduce environment here that can exactly reproduce mme score on llava. Whether you use flash attn or not may also affect the score a bit but not too much and can be ignored
Thanks, I am using torch==2.1.2
which produces an MME score of 1513.673 for llava
which is compatible with the paper. I know there are plans to enable the same padding logic for llava_hf
models, so perhaps we can merge this as-is and revisit MME at a future date?
Thanks, I am using torch==2.1.2 which produces an MME score of 1513.673 for llava which is compatible with the paper. I know there are plans to enable the same padding logic for llava_hf models, so perhaps we can merge this as-is and revisit MME at a future date?
Yeah I think this is okay for now since for most of the benchmark the scores are similar
Yeah I think this is okay for now since for most of the benchmark the scores are similar
Great! Any chance we could merge this soon? We are working on VLM integration in trl
and would like to point the community to lmms-eval
for the release :)
Great! Any chance we could merge this soon? We are working on VLM integration in trl and would like to point the community to lmms-eval for the release :)
Hi @Luodian, most of the parts of this PR LGTM for me. Do you think we can merge it now or wait until next release? You might also wanna review the changes and see whether there are things that need to change.
Great! Any chance we could merge this soon? We are working on VLM integration in trl and would like to point the community to lmms-eval for the release :)
Hi @Luodian, most of the parts of this PR LGTM for me. Do you think we can merge it now or wait until next release? You might also wanna review the changes and see whether there are things that need to change.
Hi I think it can be merged directly, but let me see the changes and after checking I will merge it~
This PR adds the modelling code needed to evaluate
llava
models in thetransformers
format: https://huggingface.co/collections/llava-hf/llava-15-65f762d5b6941db5c2ba07e0Example command to run:
I will share some benchmark numbers shortly, but the code can be reviewed in any case :)