Apologies for another rapid PR on this; I was doing additional validation with this benchmark yesterday and realized the F1-scores were incorrectly computed. Instead of being token-level F1, WebSRC was scoring based on a character-level.
This PR is a small bugfix that rectifies that (and a KeyError in submission compilation).
Unfortunately, the previous metrics gave an elevated sense of how LLaVA performs. The token-level F1 shows a score of 30.9 now with the following cross-tabs:
Apologies for another rapid PR on this; I was doing additional validation with this benchmark yesterday and realized the F1-scores were incorrectly computed. Instead of being token-level F1, WebSRC was scoring based on a character-level.
This PR is a small bugfix that rectifies that (and a KeyError in submission compilation).
Unfortunately, the previous metrics gave an elevated sense of how LLaVA performs. The token-level F1 shows a score of 30.9 now with the following cross-tabs: