Open lambert-x opened 6 months ago
I can also confirm that I get a low result using the command:
lmms_eval --model llava --model_args pretrained="lmms-lab/llama3-llava-next-8b,conv_template=llava_llama_3" --tasks mmvet --batch_size 1 --log_samples --log_samples_suffix llava_next.mmvet --output_path ./logs/
I get:
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|
mmvet | Yaml | none | 0 | gpt_eval_score | 25.1376 | ± | N/A |
Which is significantly lower than the 37 and 37.6 you get with the Mistral and Vicuna versions. This also seems to happen with the LLaVA-W benchmark:
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|
llava_in_the_wild | Yaml | none | 0 | gpt_eval_llava_conv | 57.4 | ± | N/A |
none | 0 | gpt_eval_llava_detail | 80.8 | ± | N/A | ||
none | 0 | gpt_eval_llava_complex | 86.5 | ± | N/A | ||
none | 0 | gpt_eval_llava_all | 76.5 | ± | N/A |
When the reported result here is 80.1.
Hi @jeffhernandez1995, I run the same command you use instead I use multi processes.
accelerate launch --main_process_port 12345 --num_processes 8 -m lmms_eval --model llava --model_args pretrained="lmms-lab/llama3-llava-next-8b,conv_template=llava_llama_3,device_map=""" --tasks mmvet --batch_size 1 --log_samples --log_samples_suffix llava_next.mmvet --output_path ./logs/
and here are the results I get
Can you check whether in processing results, gpt parse every prediction correctly?
For llava in the wild, I have to say that gpt-4 sometimes may give different scores. Here is the original result logs we obtain when reporting mmvet and llava in the wild for the llama3_llava
Hi @lambert-x , can you also check whether you process the result using gpt correctly? I can't reproduce using llava-1.5-7b. Also using the newest LLaVA-NeXT repo
Here is the result I get
I'll check with multiprocessing and the original LlaVA results and let you know.
I think the main reason should not caused by the multiprocessing. Possibly some errors occur such as failed to get gpt response during postprocess
Sorry, my bad. I, for some reason, changed the evaluator to gpt-4-turbo
instead of the default one and must have forgotten. After reverting the change the scores are normal. Thank you for the great work you have put into this library!
Hi, thanks for your great work. I am reproducing the evaluation results with the latest codebase and also the latest LLaVA codebase. The results of other benchmarks are matched or have minor differences. However, the performance with MM-Vet is very low. Could you please check the evaluation with MM-Vet from your side? Or could you please tell me what I should be care of? Thank you!