The forward pass computes F1 against each reference answer, and adds that F1 score to the reference answer type's bucket for computing type-wise F1: here
Whereas, the official evaluator finds the reference closest to the prediction and adds the F1 to that reference answer type's bucket: here
The implementation in the forward pass is more sensible because the set of questions that contribute to each type's F1 will be independent of the model output. The evaluator needs to be fixed.
The forward pass computes F1 against each reference answer, and adds that F1 score to the reference answer type's bucket for computing type-wise F1: here
Whereas, the official evaluator finds the reference closest to the prediction and adds the F1 to that reference answer type's bucket: here
The implementation in the forward pass is more sensible because the set of questions that contribute to each type's F1 will be independent of the model output. The evaluator needs to be fixed.