Open ivo-1 opened 1 year ago
I've additionally validated that the evaluation mis-calculates when the first two rows contain the wrong answer, but not when first and third row contain the wrong answer and the second row is correct. It's also correct when all rows contain exactly one wrong answer.
In the case of first and third row containing exactly one wrong answer (for the post town), the evaluation is correct (although the issue with the micro-averaged F1 persists):
F1 P R
(UC) 89.6±6.2 89.6±6.2 89.6±6.2
address 73±16 73±16 73±16
money 100±0 100±0 100±0
town 33±33 33±33 33±33
postcode 100±0 100±0 100±0
street 100±0 100±0 100±0
name 100±0 100±0 100±0
number 100±0 100±0 100±0
income 100±0 100±0 100±0
spending 100±0 100±0 100±0
date 100±0 100±0 100±0
F1 89.6±6.2
Accuracy 33±33
Mean-F1 89.2±6.7
The evaluation is also correct when the first key is wrong in all three documents. So maybe just an edge-case where it goes completely wrong.
I think there is a bug with the evaluation.
Consider this minimal example: expected.tsv
out_1.tsv (1 wrong answer for address__post_town in the first document)
out_2.tsv (2 wrong answers for address__post_town in the first and the second document)
These two out.tsv files yield the same evaluation result for me with the official evaluation script that is used in the README.
The evaluations yield:
It seems like this evaluation would be correct for
out_1.tsv
as then the Mean-F1 (macro-average over documents) is (4/5 + 8/8 + 8/8)/3 = 0.933. Forout_2.tsv
this should then be (4/5 + 7/8 + 8/8)/3 = 0.892.Let me know if you can reproduce this or not.
A separate but related issue: As mentioned in #5 the F1 score at the bottom should be micro-averaged F1 over all predictions. This also doesn't work out for either
out_1.tsv
orout_2.tsv
as it would be 20 (correct key-value pairs)/21 (number of key-value pairs in solution) = 0.952 and 19/21 = 0.905 respectively.