WebSRC is a Q&A dataset on website screenshots. Originally intended to be compatible with multimodal models ingesting HTML, text, and images, this evaluation formats the task as a Q&A dataset solely on the images (no HTML nor OCR is provided to a model).
The benchmark is scored via token-level F1 score. The validation split uses this metric because ground truth answers are available. For the test split, this code compiles a submission file which may be emailed to the original researchers and officially scored.
I've validated this code using liuhaotian/llava-v1.5-7b which scores a 51.8 F1-score overall. The domain-specific crosstabs are:
This PR adds WebSRC as an additional benchmark.
WebSRC is a Q&A dataset on website screenshots. Originally intended to be compatible with multimodal models ingesting HTML, text, and images, this evaluation formats the task as a Q&A dataset solely on the images (no HTML nor OCR is provided to a model).
The benchmark is scored via token-level F1 score. The validation split uses this metric because ground truth answers are available. For the test split, this code compiles a submission file which may be emailed to the original researchers and officially scored.
I've validated this code using liuhaotian/llava-v1.5-7b which scores a 51.8 F1-score overall. The domain-specific crosstabs are:
(Unscored domains do not have samples in the validation split; this dataset was not balanced with respect to domain when sampling splits).
Full JSON results: WebSRC_val_results.json