Reproducing original reported results (help needed from maintainers)
Since the tinyllava package may have some dependency conflicts with lmms-eval, I follow the existing example of llava to build tinyllava and lmms-eval without dependencies and later on install deps with a requirement file miscs/tinyllava_repr_requirements.txt. The setup file with described steps is miscs/tinyllava_repr_scripts.sh.
MME
MMMU_val
MMVet
POPE
ScienceQA_img
TextVQA
GQA
VQAv2
reported
1466.4
38.4
37.5
87.2
73.0
60.3
62.1
80.1 (test)
reproduced
1467.0
38.6
34.5
87.3
72.9
55.8
62.2
78.2 (val)
The above table compares the official results reported in the last row of this table with those reproduced by me using lmms-eval.
There's noticeable discrepancy on MMVet and TextVQA. I'm not sure if it's because the potential discrepancy in the evaluation setup (I'm new to this field) and I would appreciate experienced people (e.g. the maintainers) to have a look.
For MMVet, lmms-eval seems to use GPT for evaluation as the final results.json has a gpt_eval_score,none field for MMVet result (where I'm getting the 34.5 number). However, in TinyLLaVA's evaluation instruction, MMVet's result needs to be submitted to a evaluation server.
For TextVQA(_val), I'm taking the number from the exact_match,none field under textvqa_val from the results.json. However I also see a submission,none field there and I'm not sure if it means that TextVQA results need to be submitted somewhere. Meanwhile it's unclear what metric TinyLLaVA is reporting for TextVQA(_val) due to limited documentation.
About
As the title says, this PR adds TinyLLaVA model family.
Reproducing original reported results (help needed from maintainers)
Since the
tinyllava
package may have some dependency conflicts withlmms-eval
, I follow the existing example of llava to buildtinyllava
andlmms-eval
without dependencies and later on install deps with a requirement filemiscs/tinyllava_repr_requirements.txt
. The setup file with described steps ismiscs/tinyllava_repr_scripts.sh
.The above table compares the official results reported in the last row of this table with those reproduced by me using lmms-eval.
There's noticeable discrepancy on MMVet and TextVQA. I'm not sure if it's because the potential discrepancy in the evaluation setup (I'm new to this field) and I would appreciate experienced people (e.g. the maintainers) to have a look.
results.json
has agpt_eval_score,none
field for MMVet result (where I'm getting the34.5
number). However, in TinyLLaVA's evaluation instruction, MMVet's result needs to be submitted to a evaluation server.exact_match,none
field undertextvqa_val
from theresults.json
. However I also see asubmission,none
field there and I'm not sure if it means that TextVQA results need to be submitted somewhere. Meanwhile it's unclear what metric TinyLLaVA is reporting for TextVQA(_val) due to limited documentation.