EvolvingLMMs-Lab / lmms-eval

Accelerating the development of large multimodal models (LMMs) with lmms-eval
https://lmms-lab.github.io/
Other
1.02k stars 52 forks source link

Bugfix: WebSRC should be token-level F1 NOT character-level #70

Closed hunterheiden closed 1 month ago

hunterheiden commented 1 month ago

Apologies for another rapid PR on this; I was doing additional validation with this benchmark yesterday and realized the F1-scores were incorrectly computed. Instead of being token-level F1, WebSRC was scoring based on a character-level.

This PR is a small bugfix that rectifies that (and a KeyError in submission compilation).

Unfortunately, the previous metrics gave an elevated sense of how LLaVA performs. The token-level F1 shows a score of 30.9 now with the following cross-tabs:

Model auto book camera game jobs movie phone restaurant sports university hotel
liuhaotian/llava-v1.5-7b 48.0 60.7 - 30.8 10.9 60.0 27.2 - 12.8 - -