Evaluation gives wrong results

I think there is a bug with the evaluation.

Consider this minimal example: expected.tsv

address__post_town=BROADWAY address__postcode=WR12_7NL charity_name=Wormington_Village_Society charity_number=1155074 report_date=2018-07-31
address__post_town=WESTCLIFF-ON-SEA address__postcode=SS0_8HX address__street_line=47_SECOND_AVENUE charity_name=Havens_Christian_Hospice charity_number=1022119 income_annually_in_british_pounds=10348000.00 report_date=2016-03-31 spending_annually_in_british_pounds=9415000.00
address__post_town=CHELTENHAM address__postcode=GL50_3EP address__street_line=BAYSHILL_ROAD charity_name=Cheltenham_Ladies_College charity_number=311722 income_annually_in_british_pounds=32168000.00 report_date=2016-07-31 spending_annually_in_british_pounds=27972000.00

out_1.tsv (1 wrong answer for address__post_town in the first document)

address__post_town=Wrong address__postcode=WR12_7NL charity_name=Wormington_Village_Society charity_number=1155074 report_date=2018-07-31
address__post_town=WESTCLIFF-ON-SEA address__postcode=SS0_8HX address__street_line=47_SECOND_AVENUE charity_name=Havens_Christian_Hospice charity_number=1022119 income_annually_in_british_pounds=10348000.00 report_date=2016-03-31 spending_annually_in_british_pounds=9415000.00
address__post_town=CHELTENHAM address__postcode=GL50_3EP address__street_line=BAYSHILL_ROAD charity_name=Cheltenham_Ladies_College charity_number=311722 income_annually_in_british_pounds=32168000.00 report_date=2016-07-31 spending_annually_in_british_pounds=27972000.00

out_2.tsv (2 wrong answers for address__post_town in the first and the second document)

address__post_town=Wrong address__postcode=WR12_7NL charity_name=Wormington_Village_Society charity_number=1155074 report_date=2018-07-31
address__post_town=Wrong address__postcode=SS0_8HX address__street_line=47_SECOND_AVENUE charity_name=Havens_Christian_Hospice charity_number=1022119 income_annually_in_british_pounds=10348000.00 report_date=2016-03-31 spending_annually_in_british_pounds=9415000.00
address__post_town=CHELTENHAM address__postcode=GL50_3EP address__street_line=BAYSHILL_ROAD charity_name=Cheltenham_Ladies_College charity_number=311722 income_annually_in_british_pounds=32168000.00 report_date=2016-07-31 spending_annually_in_british_pounds=27972000.00

These two out.tsv files yield the same evaluation result for me with the official evaluation script that is used in the README.

The evaluations yield:

        F1      P       R
(UC)    94.4±5.6        94.4±5.6        94.4±5.6
address 86±14   86±14   86±14
money   100±0   100±0   100±0
town    67±33   67±33   67±33
postcode        100±0   100±0   100±0
street  100±0   100±0   100±0
name    100±0   100±0   100±0
number  100±0   100±0   100±0
income  100±0   100±0   100±0
spending        100±0   100±0   100±0
date    100±0   100±0   100±0

F1      94.4±5.6
Accuracy        67±33
Mean-F1 93.3±6.7

It seems like this evaluation would be correct for out_1.tsv as then the Mean-F1 (macro-average over documents) is (4/5 + 8/8 + 8/8)/3 = 0.933. For out_2.tsv this should then be (4/5 + 7/8 + 8/8)/3 = 0.892.

Let me know if you can reproduce this or not.

A separate but related issue: As mentioned in #5 the F1 score at the bottom should be micro-averaged F1 over all predictions. This also doesn't work out for either out_1.tsv or out_2.tsv as it would be 20 (correct key-value pairs)/21 (number of key-value pairs in solution) = 0.952 and 19/21 = 0.905 respectively.

applicaai / kleister-charity

Evaluation gives wrong results #6