EvolvingLMMs-Lab / lmms-eval

Accelerating the development of large multimodal models (LMMs) with lmms-eval
https://lmms-lab.github.io/
Other
1.03k stars 53 forks source link

[New Task] WebSRC (multimodal Q&A on web screenshots) #69

Closed hunterheiden closed 1 month ago

hunterheiden commented 1 month ago

This PR adds WebSRC as an additional benchmark.

WebSRC is a Q&A dataset on website screenshots. Originally intended to be compatible with multimodal models ingesting HTML, text, and images, this evaluation formats the task as a Q&A dataset solely on the images (no HTML nor OCR is provided to a model).

The benchmark is scored via token-level F1 score. The validation split uses this metric because ground truth answers are available. For the test split, this code compiles a submission file which may be emailed to the original researchers and officially scored.

I've validated this code using liuhaotian/llava-v1.5-7b which scores a 51.8 F1-score overall. The domain-specific crosstabs are:

Model auto book camera game jobs movie phone restaurant sports university hotel
liuhaotian/llava-v1.5-7b 57.8 64.9 - 58.8 39.2 68.5 58.8 - 36.6 - -

(Unscored domains do not have samples in the validation split; this dataset was not balanced with respect to domain when sampling splits).

Full JSON results: WebSRC_val_results.json

Luodian commented 1 month ago

Thanks! The format of this PR is quite good! I will test it and then soon merge it.