Open ivo-1 opened 1 year ago
cc: @filipggg
ad 1) UC -> uncased (we are not checking correctness of the casing) ad 2) @filipggg should know the answer ad 3) F1 (UC1) or Mean F1 (both should the same numbers) -> but @filipggg please confirm
thanks! much appreciated and also curious to hear confirmation and more info :)
So, when I run the evaluation script, I don't get the same numbers for the various F1 scores.
Example:
F1 P R
(UC) 64.6±1.8 64.5±1.7 64.8±1.9
address 60.0±3.0 58.7±3.0 61.3±3.1
money 46.1±4.3 46.9±4.3 45.2±4.2
town 76.9±3.8 75.6±3.9 78.3±3.9
postcode 67.3±4.4 67.0±4.2 67.8±4.7
street 34.5±4.9 33.0±4.8 36.3±5.0
name 59.8±4.5 59.9±4.5 59.7±4.4
number 87.9±3.0 89.1±2.9 86.7±3.2
income 45.6±4.7 46.2±4.8 44.8±4.4
spending 47.1±4.6 47.8±4.5 46.5±4.7
date 95.7±1.8 95.9±1.8 95.5±1.8
F1 49.9±1.4
Accuracy 4.7±1.9
Mean-F1 64.4±1.8
So I get 64.6 for (UC) F1, 49.9 for F1 and 64.4 for Mean-F1. The fact that (UC) F1 is a little bit higher than Mean-F1 makes sense to me because, as you explained, UC means uncased. In fact, in all my evaluations, (UC) F1 >= Mean-F1. So far so good. But what is F1 which is considerably(!) lower with just 49.9 supposed to be?
I also have some other questions:
Why are there confidence intervals (±) and what do they mean considering there is always exactly 1 correct answer? Struggling to see how this makes sense.
Is the Mean-F1 a micro- or macro-average? By my calculations (summing the F1 scores of the keys (which, I'm not sure if they are (UC) or not): [town, postcode, street, name, number, income, spending, date] and dividing by 8), the macro-average is 64.35 which would align with the Mean-F1 score given in the evaluation (rounded). However, on the hand-crafted run provided at https://kleister.info/challenge/kleister-charity the math doesn't check out:
SUM (66.1±3.2 | 0±0 | 0±0 | 59.6±4.4 | 0±0 | 0±0 | 0±0 | 0±0) / 8 keys = 15.7125
This doesn't match any of the given F1 scores (Mean-F1: 24.4, F1: 24.67, F1 (UC): 24.67).
So how are these F1 scores calculated?
I think everyone would really benefit if the evaluation could be explained holistically...
@ivo-1
Conifdence intervals come from Bootstrap sampling, similarly as it is used commonly in machine translation (see e.g. https://aclanthology.org/W04-3250.pdf).
F1 is micro-average. Mean-F1 is macro-average but averaged across the documents, not data point classes.
AFAIR it was F1, but I'd need to double-check this.
@filipggg thank you for taking the time!
Re 1: So, in a nutshell, you a.) draw 440 predictions with replacement from the 440 total predictions b.) evaluate these 440 samples with the solution (0 for wrong, 1 for correct per each key) c.) calculate the sample mean and sample variance accordingly d.) use Student’s t-distribution to calculate the true mean with probability 0.95 and the respective confidence interval around the true mean e.) you repeat steps a.) - d.) e.g. 1000 times to get 1000 different distributions and then drop the 25 distributions with the lowest true mean and the 25 distributions with the highest true mean. Then calculate the average true mean with confidence 0.95 from the remaining 950 distributions.
Correct?
Re 2: Yes, that makes sense to me now. Just to re-iterate for future readers: F1 row with (UC) column (top left corner): Micro-averaged F1 score (case-insensitive) F1 (below the table with all the keys): Micro-averaged F1 score (case-sensitive) Mean-F1: Macro-averaged F1 score (over the documents) (case-insensitive)
However, I noticed that the evaluation seems to have an issue, which I will describe in a new issue.
Re 3: Would be great if you could double-check.
The evaluation script puts out 3 different F1 scores:
F1
for the(UC)
rowI have 3 questions:
Thanks again for your work!