Extremely Low Performance with LLaVA-1.5-7B on MM-Vet Benchmark

EvolvingLMMs-Lab / lmms-eval

Accelerating the development of large multimodal models (LMMs) with lmms-eval

https://lmms-lab.github.io/

Other

2k stars 148 forks source link

Extremely Low Performance with LLaVA-1.5-7B on MM-Vet Benchmark #80

Open lambert-x opened 6 months ago

lambert-x commented 6 months ago

Hi, thanks for your great work. I am reproducing the evaluation results with the latest codebase and also the latest LLaVA codebase. The results of other benchmarks are matched or have minor differences. However, the performance with MM-Vet is very low. Could you please check the evaluation with MM-Vet from your side? Or could you please tell me what I should be care of? Thank you!

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
mmvet	Yaml	none	0	gpt_eval_score	1.3761	±	N/A

jeffhernandez1995 commented 5 months ago

I can also confirm that I get a low result using the command: lmms_eval --model llava --model_args pretrained="lmms-lab/llama3-llava-next-8b,conv_template=llava_llama_3" --tasks mmvet --batch_size 1 --log_samples --log_samples_suffix llava_next.mmvet --output_path ./logs/ I get:

Tasks	Version	Filter	n-shot	Metric	Value		Stderr
mmvet	Yaml	none	0	gpt_eval_score	25.1376	±	N/A

Which is significantly lower than the 37 and 37.6 you get with the Mistral and Vicuna versions. This also seems to happen with the LLaVA-W benchmark:

Tasks	Version	Filter	Metric	Value		Stderr
llava_in_the_wild	Yaml	none	gpt_eval_llava_conv	57.4	±	N/A
		none	gpt_eval_llava_detail	80.8	±	N/A
		none	gpt_eval_llava_complex	86.5	±	N/A
		none	gpt_eval_llava_all	76.5	±	N/A

When the reported result here is 80.1.

kcz358 commented 5 months ago

Hi @jeffhernandez1995, I run the same command you use instead I use multi processes.

Command

accelerate launch --main_process_port 12345 --num_processes 8 -m lmms_eval --model llava --model_args pretrained="lmms-lab/llama3-llava-next-8b,conv_template=llava_llama_3,device_map=""" --tasks mmvet --batch_size 1 --log_samples --log_samples_suffix llava_next.mmvet --output_path ./logs/

and here are the results I get

Can you check whether in processing results, gpt parse every prediction correctly?

For llava in the wild, I have to say that gpt-4 sometimes may give different scores. Here is the original result logs we obtain when reporting mmvet and llava in the wild for the llama3_llava

llava_in_the_wild.json mmvet.json results.json

kcz358 commented 5 months ago

Hi @lambert-x , can you also check whether you process the result using gpt correctly? I can't reproduce using llava-1.5-7b. Also using the newest LLaVA-NeXT repo

Here is the result I get

jeffhernandez1995 commented 5 months ago

I'll check with multiprocessing and the original LlaVA results and let you know.

kcz358 commented 5 months ago

I think the main reason should not caused by the multiprocessing. Possibly some errors occur such as failed to get gpt response during postprocess

jeffhernandez1995 commented 5 months ago

Sorry, my bad. I, for some reason, changed the evaluator to gpt-4-turbo instead of the default one and must have forgotten. After reverting the change the scores are normal. Thank you for the great work you have put into this library!