TRI-ML / vlm-evaluation

VLM Evaluation: Benchmark for VLMs, spanning text generation tasks from VQA to Captioning
Other
64 stars 7 forks source link

Error when evaluating on POPE-full #2

Closed djghosh13 closed 2 months ago

djghosh13 commented 3 months ago

When evaluating LLaVA on the pope-full dataset, I'm getting the following error:

Traceback (most recent call last):
  File "/weka/home-djghosh/code/llava_ft/vlm-evaluation/scripts/score.py", line 145, in <module>
    score()
  File "/weka/home-djghosh/miniconda3/envs/vlm/lib/python3.10/site-packages/draccus/argparsing.py", line 203, in wrapper_inner
    response = fn(cfg, *args, **kwargs)
  File "/weka/home-djghosh/code/llava_ft/vlm-evaluation/scripts/score.py", line 104, in score
    len(full_results) == cfg.dataset.expected_examples
AssertionError: Expected 9000 model outputs, only found 8910!

It seems like POPE expects 3000 samples each from the popular, random, and adversarial splits, but the random sample contains only 2910 samples. Looking into it, from https://github.com/TRI-ML/vlm-evaluation/tree/main/vlm_eval/tasks/registry.py the download source for POPE is an older commit in which the random split does indeed have 2910 examples and not 3000.

Which of these is the expected dataset for comparison for Prismatic VLM? i.e. should pope-full expect 8910 examples or does the dataset download source need to be updated? Thanks!

djghosh13 commented 3 months ago

It also seems like at https://github.com/TRI-ML/vlm-evaluation/blob/098224f2a49bcfe2d3e1d21e75967d1c1cedcca8/vlm_eval/tasks/harnesses/pope.py#L59 in your code, it does indeed expect 2910 examples in the random split. So the dataset preparation scripts disagree with the evaluation script on how many examples there should be.

siddk commented 3 months ago

Our current code (to reproduce the results in the paper) expects the older version of the POPE dataset with the 2910 examples.

However, happy to update this moving forward — what do you think @djghosh13?

djghosh13 commented 3 months ago

Thanks for clarifying!

The newer commits to POPE seem to fix a lot of spelling errors in addition to adding in the 90 missing examples, so it would be preferable to run on that, probably. The changes to your code should be minimal (just updating 8910 to 9000 and replacing the commit ID in the download script), but it's just a matter of whether it's worth re-running POPE evals for the already tested models.

siddk commented 2 months ago

@djghosh13 - closing this for now as we're switching over to the newest version of POPE. Will respond to this issue once we get new results on all the models (as part of our next release!)