[New Task] WebSRC (multimodal Q&A on web screenshots)

This PR adds WebSRC as an additional benchmark.

WebSRC is a Q&A dataset on website screenshots. Originally intended to be compatible with multimodal models ingesting HTML, text, and images, this evaluation formats the task as a Q&A dataset solely on the images (no HTML nor OCR is provided to a model).

The benchmark is scored via token-level F1 score. The validation split uses this metric because ground truth answers are available. For the test split, this code compiles a submission file which may be emailed to the original researchers and officially scored.

I've validated this code using liuhaotian/llava-v1.5-7b which scores a 51.8 F1-score overall. The domain-specific crosstabs are:

Model	auto	book	camera	game	jobs	movie	phone	restaurant	sports	university	hotel
liuhaotian/llava-v1.5-7b	57.8	64.9	-	58.8	39.2	68.5	58.8	-	36.6	-	-

(Unscored domains do not have samples in the validation split; this dataset was not balanced with respect to domain when sampling splits).

Full JSON results: WebSRC_val_results.json

EvolvingLMMs-Lab / lmms-eval

[New Task] WebSRC (multimodal Q&A on web screenshots) #69