Metrics to report in the paper

fedenanni commented 3 years ago

Hi all, me again on this point. I thought about it a bit and I'll try to explain here why I would report precision, recall and F1 for the label 1 instead of the macro average.

Remember that we are in a binary classification scenario with very unbalanced labels and we want to know which method is the best one at correctly predicting the 1 label (so the best one at finding correct occurrences of a specific sense). Now, consider this setting, where you have gold labels and three approaches: a majority class baseline (which always predicts 0), a random baseline and "our approach" which sometimes predicts it correctly. We want to know if we are better than the baselines in capturing the 1 cases.

gold = [0,0,0,0,1]
random = [1,0,1,0,1]
majority = [0,0,0,0,0]
our = [1,0,0,0,1]

If we compute precision and recall for each class, for each method, this is what we get:

random
label_1 = [0.33, 1.0, 0.5]
label_0 = [0.5, 1.0, 0.667]
macro = [0.667, 0.75, 0.70]

Note! I computed the macro F1 score manually from the values of precision and recall. if you use scikit learn out of the box you will get: [0.667, 0.75, 0.583] where 0.583 is not the harmonic mean of p (0.667) and r (0.75), but the average of the F1 of label_1 (0.5) and label_0 (0.667)), because of this issue. Reporting [0.667, 0.75, 0.583], I believe, would make the reader (and especially the reviewer) very confused, so I added a patch in #144 at least to compute the macro F1 correctly.

You can try yourself - you'll get the same behaviour for all the other methods.

from sklearn.metrics import precision_recall_fscore_support
gold = [0,0,0,0,1]
random = [1,0,1,0,1]
majority = [0,0,0,0,0]
our = [1,0,0,0,1]

method = random

print ("label_1", [round(x,3) for x in precision_recall_fscore_support(gold, method, average='binary',pos_label=1)[:3]])
print ("label_0", [round(x,3) for x in precision_recall_fscore_support(gold, method, average='binary',pos_label=0)[:3]])
print ("macro", [round(x,3) for x in precision_recall_fscore_support(gold, method, average='macro')[:3]])

majority
label_1 = [0.0, 0.0, 0.0]
label_0 = [0.8, 1.0, 0.889]
macro = [0.4, 0.5, 0.44]

our
label_1 [0.5, 1.0, 0.667]
label_0 [1.0, 0.75, 0.857]
macro [0.75, 0.875, 0.80]

Now, remember that our goal is to assess which method is better at finding 1s. If we consider macro it seems that random and our are not that distant, and overall this seems an easy task (if you just randomly predict you get it right around 70% of the time, for the different metrics):

random = [0.667, 0.75, 0.70]
majority = [0.4, 0.5, 0.44]
our [0.75, 0.875, 0.80]

however, if you look at label 1:

random = [0.33, 1.0, 0.5]
majority = [0.0, 0.0, 0.0]
our =  [0.5, 1.0, 0.667]

The story is a bit different and it is closer to reality (especially for precision). The task is actually hard and if you guess randomly you will return lots of false positives. Majority is a useless approach for label 1 because the majority class is label 0, so we will never return anything for label 1.

If we have to suggest to a historian what is the best method for finding occurrences of a specific sense of machine (so the goal of our ACL paper), based on this numbers we will tell them that with our approach 50% of the retrieved results will be correct (precision) with a perfect recall, while if they go with random only 33% of the retrieved results will be correct. The performance on macro are not informative to the final user, because the final user does not care about how we perform on label 0.

To conclude, I would report the results for label 1, because I think it is the most meaningful metric for the task (even if numbers will all be a bit lower - but they will more precisely represent the experimental setting and the goal of the paper). I can write this part of the paper justifying it.

@kasparvonbeelen @BarbaraMcG @mcollardanuy @kasra-hosseini @GiorgiatolfoBL let me know what you think and especially if you spot any error as I might just miss something. However, if you prefer to go with macro, no problem, but we should change a bit the argumentation in the paper maybe, so that the metric is more in line with the problem.

kasparvonbeelen commented 3 years ago

Hi @fedenanni, thanks for the careful explanation, I do agree with you after reading this, but we should add your explanation to the paper. I like it because it ties in very well with the type application we aim to develop, one that helps historians explore a particular sense, and this makes the overall idea of targeted sense disambiguation more clear to the reviewer. if you could rehash the above explanation to the paper, that'd be great! I guess we don't need to change the code a lot. Just rerun the scripts for the computing the tables.

fedenanni commented 3 years ago

ciao @kasparvonbeelen ! No problem, I'll write it out and check that the flow is consistent across the paper. I added a flag in the compute_results notebook that @mcollardanuy reviewed yesterday morning; this will print our results either for macro or 1, based on what we select. So it's just a matter of rerunning that script and adding back the final numbers to the paper - Mariona was also making some interesting plots yesterday, so I think she could take care of this tomorrow while we polish the evaluation section - thanks and sorry again for having been a bit of a pain on this point :D

kasparvonbeelen commented 3 years ago

@fedenanni , Haha, no worries. After reading your comments I totally understand your point and agree. I think it will make the paper stronger (even if the numbers are bit lower ;-) )

mcollardanuy commented 3 years ago

Thanks @fedenanni sounds good to me as well! I've updated the notebook where results are computed and changed the numbers in the paper accordingly. Could you have a look just in case? https://github.com/Living-with-machines/HistoricalDictionaryExpansion/blob/dev/create_results_tables.ipynb

mcollardanuy commented 3 years ago

Ah @kasparvonbeelen one question: is BERT1900 trained on data until 1900 or 1920? If the latter then should be change the name to BERT1920 so it's aligned with the experiment?

kasparvonbeelen commented 3 years ago

@mcollardanuy I am not sure actually if there was a cut-off for training BERT, I think @kasra-hosseini knows. 1900 is a proxy-for "whole nineteenth-century book corpus" (which has a few books later 1900 I suppose.) For the experiments, I used 1760-1920 to refer to the "long nineteenth-century" as it is a more historically motivate periodization. Hope this is clear?

kasra-hosseini commented 3 years ago

~The four LMs were FT on books published: before 1850, between 1850 and 1875, between 1875 and 1890, and between 1890 and 1900~.

mcollardanuy commented 3 years ago

Hi @kasra-hosseini, I meant BLERT. Is 1900 the end date as well?

kasra-hosseini commented 3 years ago

Hi, no, there is no end date on BLERT. Sorry, let me correct myself. The two BERT models used here are:

BLERT: trained on all BL books (on our DB)
pre1850

BarbaraMcG commented 3 years ago

Thanks @fedenanni sounds good to me as well! I've updated the notebook where results are computed and changed the numbers in the paper accordingly. Could you have a look just in case? https://github.com/Living-with-machines/HistoricalDictionaryExpansion/blob/dev/create_results_tables.ipynb

I come to this late, but I agree with @fedenanni 's explanation, it's convincing especially for the use case (historical research) we have in mind

Living-with-machines / TargetedSenseDisambiguation

Metrics to report in the paper #145