EvolvingLMMs-Lab / lmms-eval

Accelerating the development of large multimodal models (LMMs) with lmms-eval
https://lmms-lab.github.io/
Other
2.03k stars 149 forks source link

[Reproduce] Unable to reproduce AI2D, ChartQA and InfoVQA results for llava-1.6-mistral-7b #122

Open GoGoJoestar opened 4 months ago

GoGoJoestar commented 4 months ago

I am tring to reproduce some tasks' result on llava-1.6-mistral-7b, but found large gap on AI2D, ChartQA and InfoVQA. The lmms-eval version I ues is 0.2.0.

My script:

python3 -m accelerate.commands.launch \
    --num_processes=3 \
    -m lmms_eval \
    --model llava \
    --model_args pretrained=/My/Downloaded/llava-v1.6-mistral-7b,conv_template=mistral_instruct,attn_implementation=flash_attention_2 \
    --tasks ai2d,chartqa,docvqa_val,scienceqa_full,infovqa_val \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix llava_mistral \
    --output_path ./logs/

The results from my running, lmms-eval 0.1 results excel and llava official blogs are shown below.

AI2D ChartQA infoVQA DocVQA ScienceQA-full ScienceQA-img
llava official 60.8 38.8 - 72.2 - 72.8
lmms-eval 0.1 blog 60.75 38.76 43.77 72.16 0.23 0
my reproduce 67.42 52.92 36.74 70.18 76.80 72.83

The results of DocVQA and ScienceQA-img are similar, but results of AI2D, ChartQA and InfoVQA gap from 7 to 14 point. What reason could lead to such gap? Does any task config change between lmms-eval 0.1.0 and 0.2.0?

gordonhu608 commented 3 months ago

I got the same thing for ChartQA, which is also 52.92, I'll try if lmms-eval version 0.1.0 works. But, if we all get this, I think this is a bug. Give it a [bug] tag and someone please fix it.