I am tring to reproduce some tasks' result on llava-1.6-mistral-7b, but found large gap on AI2D, ChartQA and InfoVQA. The lmms-eval version I ues is 0.2.0.
The results from my running, lmms-eval 0.1 results excel and llava official blogs are shown below.
AI2D
ChartQA
infoVQA
DocVQA
ScienceQA-full
ScienceQA-img
llava official
60.8
38.8
-
72.2
-
72.8
lmms-eval 0.1 blog
60.75
38.76
43.77
72.16
0.23
0
my reproduce
67.42
52.92
36.74
70.18
76.80
72.83
The results of DocVQA and ScienceQA-img are similar, but results of AI2D, ChartQA and InfoVQA gap from 7 to 14 point.
What reason could lead to such gap? Does any task config change between lmms-eval 0.1.0 and 0.2.0?
I got the same thing for ChartQA, which is also 52.92, I'll try if lmms-eval version 0.1.0 works. But, if we all get this, I think this is a bug. Give it a [bug] tag and someone please fix it.
I am tring to reproduce some tasks' result on llava-1.6-mistral-7b, but found large gap on AI2D, ChartQA and InfoVQA. The lmms-eval version I ues is
0.2.0
.My script:
The results from my running, lmms-eval 0.1 results excel and llava official blogs are shown below.
The results of DocVQA and ScienceQA-img are similar, but results of AI2D, ChartQA and InfoVQA gap from 7 to 14 point. What reason could lead to such gap? Does any task config change between lmms-eval
0.1.0
and0.2.0
?