EvolvingLMMs-Lab / lmms-eval

Accelerating the development of large multimodal models (LMMs) with lmms-eval
https://lmms-lab.github.io/
Other
1.39k stars 110 forks source link

Extremely Low Performance with LLaVA-1.5-7B on MM-Vet Benchmark #80

Open lambert-x opened 4 months ago

lambert-x commented 4 months ago

Hi, thanks for your great work. I am reproducing the evaluation results with the latest codebase and also the latest LLaVA codebase. The results of other benchmarks are matched or have minor differences. However, the performance with MM-Vet is very low. Could you please check the evaluation with MM-Vet from your side? Or could you please tell me what I should be care of? Thank you!

Tasks Version Filter n-shot Metric Value Stderr
mmvet Yaml none 0 gpt_eval_score 1.3761 ± N/A
jeffhernandez1995 commented 4 months ago

I can also confirm that I get a low result using the command: lmms_eval --model llava --model_args pretrained="lmms-lab/llama3-llava-next-8b,conv_template=llava_llama_3" --tasks mmvet --batch_size 1 --log_samples --log_samples_suffix llava_next.mmvet --output_path ./logs/ I get:

Tasks Version Filter n-shot Metric Value Stderr
mmvet Yaml none 0 gpt_eval_score 25.1376 ± N/A

Which is significantly lower than the 37 and 37.6 you get with the Mistral and Vicuna versions. This also seems to happen with the LLaVA-W benchmark:

Tasks Version Filter n-shot Metric Value Stderr
llava_in_the_wild Yaml none 0 gpt_eval_llava_conv 57.4 ± N/A
none 0 gpt_eval_llava_detail 80.8 ± N/A
none 0 gpt_eval_llava_complex 86.5 ± N/A
none 0 gpt_eval_llava_all 76.5 ± N/A

When the reported result here is 80.1.

kcz358 commented 4 months ago

Hi @jeffhernandez1995, I run the same command you use instead I use multi processes.

Command

accelerate launch --main_process_port 12345 --num_processes 8 -m lmms_eval --model llava --model_args pretrained="lmms-lab/llama3-llava-next-8b,conv_template=llava_llama_3,device_map=""" --tasks mmvet --batch_size 1 --log_samples --log_samples_suffix llava_next.mmvet --output_path ./logs/

and here are the results I get

image

Can you check whether in processing results, gpt parse every prediction correctly?

For llava in the wild, I have to say that gpt-4 sometimes may give different scores. Here is the original result logs we obtain when reporting mmvet and llava in the wild for the llama3_llava

llava_in_the_wild.json mmvet.json results.json

kcz358 commented 4 months ago

Hi @lambert-x , can you also check whether you process the result using gpt correctly? I can't reproduce using llava-1.5-7b. Also using the newest LLaVA-NeXT repo

Here is the result I get

image
jeffhernandez1995 commented 4 months ago

I'll check with multiprocessing and the original LlaVA results and let you know.

kcz358 commented 4 months ago

I think the main reason should not caused by the multiprocessing. Possibly some errors occur such as failed to get gpt response during postprocess

jeffhernandez1995 commented 3 months ago

Sorry, my bad. I, for some reason, changed the evaluator to gpt-4-turbo instead of the default one and must have forgotten. After reverting the change the scores are normal. Thank you for the great work you have put into this library!