EvolvingLMMs-Lab / lmms-eval

Accelerating the development of large multimodal models (LMMs) with lmms-eval
https://lmms-lab.github.io/
Other
1.39k stars 110 forks source link

Unable to reproduce SQA results for llava-1.5 #115

Open clairez-cerebras opened 3 months ago

clairez-cerebras commented 3 months ago

I was attempting to reproduce llava-1.5's results in ScienceQA but was not able to reproduce the results reported. Command:

python -m accelerate.commands.launch --num_processes=1 -m lmms_eval --config ./configs/eval_scienceqa_llava1.5.yaml

Config:

- model: llava
  model_args: pretrained=liuhaotian/llava-v1.5-7b,use_flash_attention_2=False,model_name=llava
  tasks: scienceqa_full
  batch_size: 1
  log_samples: true
  log_samples_suffix: llava1.5_sqa
  output_path: "./logs/"

The results I got:

|     Tasks      |Version|Filter|n-shot|  Metric   |Value |   |Stderr|
|----------------|-------|------|-----:|-----------|-----:|---|-----:|
|scienceqa_full  |N/A    |none  |     0|exact_match|0.3699|±  |0.0097|
| - scienceqa    |Yaml   |none  |     0|exact_match|0.3744|±  |0.0074|
| - scienceqa_img|Yaml   |none  |     0|exact_match|0.3604|±  |0.0107|

|    Groups    |Version|Filter|n-shot|  Metric   |Value |   |Stderr|
|--------------|-------|------|-----:|-----------|-----:|---|-----:|
|scienceqa_full|N/A    |none  |     0|exact_match|0.3699|±  |0.0097|

which is far from what's reported in the paper, for example, SQA-IMG is reported to have 71.6 in the llava-1.5 paper and SQA in general is reported to be around 70.4 in the excel sheet What could be wrong?

kcz358 commented 3 months ago

Thank you for reporting the issue. I will try to look into this error later.

GoGoJoestar commented 3 months ago

I I encountered the same problem when reproduce llava-1.6-mistral-7b results in ScienceQA. I found the reason maybe the following lines in models/llava.py.

https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/efb529552c5e4ba039a4cba8e9aa5cb7ba65bf90/lmms_eval/models/llava.py#L361-L371

Although the annotation says “The above for loop has bug” when input has no visuals, but actualy, the above loop run normally and add a prompt_question to the question_input list, and then these line add a prompt_question again. As the result, these no visual inputs generate 2 answers, leading to order mismatch of questions and answers.

After remove these line code, the scienceqa-full result changes from 36.3 to 76.8.

kcz358 commented 3 months ago

Hi @GoGoJoestar , I think your fix is correct. We previously use flattened visuals instead of batched visuals in the previous loop, resulting error when handling none visuals. I will remove these lines