Evaluation - Differences between F1 scores

ivo-1 commented 1 year ago

The evaluation script puts out 3 different F1 scores:

Column F1 for the (UC) row
F1 score (below all the keys)
Mean F1 score

I have 3 questions:

What does UC mean?
How do those F1 scores compare?
Which F1 score is reported in the accompanying paper?

Thanks again for your work!

tstanislawek commented 1 year ago

cc: @filipggg

ad 1) UC -> uncased (we are not checking correctness of the casing) ad 2) @filipggg should know the answer ad 3) F1 (UC1) or Mean F1 (both should the same numbers) -> but @filipggg please confirm

ivo-1 commented 1 year ago

thanks! much appreciated and also curious to hear confirmation and more info :)

ivo-1 commented 1 year ago

So, when I run the evaluation script, I don't get the same numbers for the various F1 scores.

Example:

        F1                P                   R
(UC)     64.6±1.8        64.5±1.7        64.8±1.9
address  60.0±3.0        58.7±3.0        61.3±3.1
money    46.1±4.3        46.9±4.3        45.2±4.2
town     76.9±3.8        75.6±3.9        78.3±3.9
postcode 67.3±4.4        67.0±4.2        67.8±4.7
street   34.5±4.9        33.0±4.8        36.3±5.0
name     59.8±4.5        59.9±4.5        59.7±4.4
number   87.9±3.0        89.1±2.9        86.7±3.2
income   45.6±4.7        46.2±4.8        44.8±4.4
spending 47.1±4.6        47.8±4.5        46.5±4.7
date     95.7±1.8        95.9±1.8        95.5±1.8
F1       49.9±1.4
Accuracy 4.7±1.9
Mean-F1  64.4±1.8

So I get 64.6 for (UC) F1, 49.9 for F1 and 64.4 for Mean-F1. The fact that (UC) F1 is a little bit higher than Mean-F1 makes sense to me because, as you explained, UC means uncased. In fact, in all my evaluations, (UC) F1 >= Mean-F1. So far so good. But what is F1 which is considerably(!) lower with just 49.9 supposed to be?

I also have some other questions:

Why are there confidence intervals (±) and what do they mean considering there is always exactly 1 correct answer? Struggling to see how this makes sense.
Is the Mean-F1 a micro- or macro-average? By my calculations (summing the F1 scores of the keys (which, I'm not sure if they are (UC) or not): [town, postcode, street, name, number, income, spending, date] and dividing by 8), the macro-average is 64.35 which would align with the Mean-F1 score given in the evaluation (rounded). However, on the hand-crafted run provided at https://kleister.info/challenge/kleister-charity the math doesn't check out:

SUM (66.1±3.2 | 0±0 | 0±0 | 59.6±4.4 | 0±0 | 0±0 | 0±0 | 0±0) / 8 keys = 15.7125

This doesn't match any of the given F1 scores (Mean-F1: 24.4, F1: 24.67, F1 (UC): 24.67).

So how are these F1 scores calculated?

Which of the three F1 scores are you reporting in the paper?

I think everyone would really benefit if the evaluation could be explained holistically...

filipggg commented 1 year ago

@ivo-1

Conifdence intervals come from Bootstrap sampling, similarly as it is used commonly in machine translation (see e.g. https://aclanthology.org/W04-3250.pdf).
F1 is micro-average. Mean-F1 is macro-average but averaged across the documents, not data point classes.
AFAIR it was F1, but I'd need to double-check this.

ivo-1 commented 1 year ago

@filipggg thank you for taking the time!

Re 1: So, in a nutshell, you a.) draw 440 predictions with replacement from the 440 total predictions b.) evaluate these 440 samples with the solution (0 for wrong, 1 for correct per each key) c.) calculate the sample mean and sample variance accordingly d.) use Student’s t-distribution to calculate the true mean with probability 0.95 and the respective confidence interval around the true mean e.) you repeat steps a.) - d.) e.g. 1000 times to get 1000 different distributions and then drop the 25 distributions with the lowest true mean and the 25 distributions with the highest true mean. Then calculate the average true mean with confidence 0.95 from the remaining 950 distributions.

Correct?

Re 2: Yes, that makes sense to me now. Just to re-iterate for future readers: F1 row with (UC) column (top left corner): Micro-averaged F1 score (case-insensitive) F1 (below the table with all the keys): Micro-averaged F1 score (case-sensitive) Mean-F1: Macro-averaged F1 score (over the documents) (case-insensitive)

However, I noticed that the evaluation seems to have an issue, which I will describe in a new issue.

Re 3: Would be great if you could double-check.

applicaai / kleister-charity

Evaluation - Differences between F1 scores #5