applicaai / kleister-charity

37 stars 7 forks source link

Evaluation gives wrong results #6

Open ivo-1 opened 1 year ago

ivo-1 commented 1 year ago

I think there is a bug with the evaluation.

Consider this minimal example: expected.tsv

address__post_town=BROADWAY address__postcode=WR12_7NL charity_name=Wormington_Village_Society charity_number=1155074 report_date=2018-07-31
address__post_town=WESTCLIFF-ON-SEA address__postcode=SS0_8HX address__street_line=47_SECOND_AVENUE charity_name=Havens_Christian_Hospice charity_number=1022119 income_annually_in_british_pounds=10348000.00 report_date=2016-03-31 spending_annually_in_british_pounds=9415000.00
address__post_town=CHELTENHAM address__postcode=GL50_3EP address__street_line=BAYSHILL_ROAD charity_name=Cheltenham_Ladies_College charity_number=311722 income_annually_in_british_pounds=32168000.00 report_date=2016-07-31 spending_annually_in_british_pounds=27972000.00

out_1.tsv (1 wrong answer for address__post_town in the first document)

address__post_town=Wrong address__postcode=WR12_7NL charity_name=Wormington_Village_Society charity_number=1155074 report_date=2018-07-31
address__post_town=WESTCLIFF-ON-SEA address__postcode=SS0_8HX address__street_line=47_SECOND_AVENUE charity_name=Havens_Christian_Hospice charity_number=1022119 income_annually_in_british_pounds=10348000.00 report_date=2016-03-31 spending_annually_in_british_pounds=9415000.00
address__post_town=CHELTENHAM address__postcode=GL50_3EP address__street_line=BAYSHILL_ROAD charity_name=Cheltenham_Ladies_College charity_number=311722 income_annually_in_british_pounds=32168000.00 report_date=2016-07-31 spending_annually_in_british_pounds=27972000.00

out_2.tsv (2 wrong answers for address__post_town in the first and the second document)

address__post_town=Wrong address__postcode=WR12_7NL charity_name=Wormington_Village_Society charity_number=1155074 report_date=2018-07-31
address__post_town=Wrong address__postcode=SS0_8HX address__street_line=47_SECOND_AVENUE charity_name=Havens_Christian_Hospice charity_number=1022119 income_annually_in_british_pounds=10348000.00 report_date=2016-03-31 spending_annually_in_british_pounds=9415000.00
address__post_town=CHELTENHAM address__postcode=GL50_3EP address__street_line=BAYSHILL_ROAD charity_name=Cheltenham_Ladies_College charity_number=311722 income_annually_in_british_pounds=32168000.00 report_date=2016-07-31 spending_annually_in_british_pounds=27972000.00

These two out.tsv files yield the same evaluation result for me with the official evaluation script that is used in the README.

The evaluations yield:

        F1      P       R
(UC)    94.4±5.6        94.4±5.6        94.4±5.6
address 86±14   86±14   86±14
money   100±0   100±0   100±0
town    67±33   67±33   67±33
postcode        100±0   100±0   100±0
street  100±0   100±0   100±0
name    100±0   100±0   100±0
number  100±0   100±0   100±0
income  100±0   100±0   100±0
spending        100±0   100±0   100±0
date    100±0   100±0   100±0

F1      94.4±5.6
Accuracy        67±33
Mean-F1 93.3±6.7

It seems like this evaluation would be correct for out_1.tsv as then the Mean-F1 (macro-average over documents) is (4/5 + 8/8 + 8/8)/3 = 0.933. For out_2.tsv this should then be (4/5 + 7/8 + 8/8)/3 = 0.892.

Let me know if you can reproduce this or not.

A separate but related issue: As mentioned in #5 the F1 score at the bottom should be micro-averaged F1 over all predictions. This also doesn't work out for either out_1.tsv or out_2.tsv as it would be 20 (correct key-value pairs)/21 (number of key-value pairs in solution) = 0.952 and 19/21 = 0.905 respectively.

ivo-1 commented 1 year ago

I've additionally validated that the evaluation mis-calculates when the first two rows contain the wrong answer, but not when first and third row contain the wrong answer and the second row is correct. It's also correct when all rows contain exactly one wrong answer.

In the case of first and third row containing exactly one wrong answer (for the post town), the evaluation is correct (although the issue with the micro-averaged F1 persists):

        F1      P       R
(UC)    89.6±6.2        89.6±6.2        89.6±6.2
address 73±16   73±16   73±16
money   100±0   100±0   100±0
town    33±33   33±33   33±33
postcode        100±0   100±0   100±0
street  100±0   100±0   100±0
name    100±0   100±0   100±0
number  100±0   100±0   100±0
income  100±0   100±0   100±0
spending        100±0   100±0   100±0
date    100±0   100±0   100±0

F1      89.6±6.2
Accuracy        33±33
Mean-F1 89.2±6.7

The evaluation is also correct when the first key is wrong in all three documents. So maybe just an edge-case where it goes completely wrong.