Answer type F1 calculation in the evaluator is different from the one on the model's forward pass

The forward pass computes F1 against each reference answer, and adds that F1 score to the reference answer type's bucket for computing type-wise F1: here

Whereas, the official evaluator finds the reference closest to the prediction and adds the F1 to that reference answer type's bucket: here

The implementation in the forward pass is more sensible because the set of questions that contribute to each type's F1 will be independent of the model output. The evaluator needs to be fixed.

allenai / qasper-led-baseline

Answer type F1 calculation in the evaluator is different from the one on the model's forward pass #14