New Task: ScreenSpot - Grounding (REC) and instruction generation (REG) on screens

EvolvingLMMs-Lab / lmms-eval

Accelerating the development of large multimodal models (LMMs) with lmms-eval

https://lmms-lab.github.io/

Other

1.03k stars 53 forks source link

New Task: ScreenSpot - Grounding (REC) and instruction generation (REG) on screens #63

Closed hunterheiden closed 2 months ago

hunterheiden commented 2 months ago

This PR adds the ScreenSpot benchmark from the SeeClick paper. This benchmark was designed for grounding (i.e., referring expression comprehension (REC)), but can also be inverted to yield an instruction generation (i.e., referring expression generation (REG)) benchmark. The data domain is screens: mobile, desktop, and web screenshots.

This PR simply adds ScreenSpot, mirroring RefCOCO style datasets for REC/REG.

Attached results.json shows that LLaVA-v1.5-7B achieves:

9.7% center accuracy (for grounding)
0.97 CIDEr (for instruction generation)

Luodian commented 2 months ago

Hi! Thanks for this PR. This looks clean and well-formed.

But I have a suggestion, it's better to add a README.md file inside the lmms_eval/tasks/screenspot folder. Since we are adding more and more tasks, we may gradually need to add a brief introduction for each task~

The form can be: https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/arc

hunterheiden commented 2 months ago

Makes sense! I've generated a README for this task now. I provided the information mentioned in the form and added a bit about the metrics used as well. Let me know if the formatting should be different or if this suffices