[Reproduce] Unable to reproduce AI2D, ChartQA and InfoVQA results for llava-1.6-mistral-7b

I am tring to reproduce some tasks' result on llava-1.6-mistral-7b, but found large gap on AI2D, ChartQA and InfoVQA. The lmms-eval version I ues is 0.2.0.

My script:

python3 -m accelerate.commands.launch \
    --num_processes=3 \
    -m lmms_eval \
    --model llava \
    --model_args pretrained=/My/Downloaded/llava-v1.6-mistral-7b,conv_template=mistral_instruct,attn_implementation=flash_attention_2 \
    --tasks ai2d,chartqa,docvqa_val,scienceqa_full,infovqa_val \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix llava_mistral \
    --output_path ./logs/

The results from my running, lmms-eval 0.1 results excel and llava official blogs are shown below.

AI2D	ChartQA	infoVQA	DocVQA	ScienceQA-full	ScienceQA-img
llava official	60.8	38.8	-	72.2	-	72.8
lmms-eval 0.1 blog	60.75	38.76	43.77	72.16	0.23	0
my reproduce	67.42	52.92	36.74	70.18	76.80	72.83

The results of DocVQA and ScienceQA-img are similar, but results of AI2D, ChartQA and InfoVQA gap from 7 to 14 point. What reason could lead to such gap? Does any task config change between lmms-eval 0.1.0 and 0.2.0?

EvolvingLMMs-Lab / lmms-eval

[Reproduce] Unable to reproduce AI2D, ChartQA and InfoVQA results for llava-1.6-mistral-7b #122