Bugfix: WebSRC should be token-level F1 NOT character-level

Apologies for another rapid PR on this; I was doing additional validation with this benchmark yesterday and realized the F1-scores were incorrectly computed. Instead of being token-level F1, WebSRC was scoring based on a character-level.

This PR is a small bugfix that rectifies that (and a KeyError in submission compilation).

Unfortunately, the previous metrics gave an elevated sense of how LLaVA performs. The token-level F1 shows a score of 30.9 now with the following cross-tabs:

Model	auto	book	camera	game	jobs	movie	phone	restaurant	sports	university	hotel
liuhaotian/llava-v1.5-7b	48.0	60.7	-	30.8	10.9	60.0	27.2	-	12.8	-	-

EvolvingLMMs-Lab / lmms-eval

Bugfix: WebSRC should be token-level F1 NOT character-level #70