Question about the f1 score classes

flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)

https://flairnlp.github.io/flair/

Other

13.94k stars 2.1k forks source link

Question about the f1 score classes #318

Closed weiczhu closed 5 years ago

weiczhu commented 5 years ago

Hi, We are planning to do a compare with Flair and other model for NER. As for the F1 score, there is a function get_classes, could you clarify what classed are used for calculating the f1 score? Do you calculate the micro average for all the labels ('B-', 'I-' and includes 'O') or just the positive labels ('B-', 'I-' excludes 'O') ? Because I see for some BERT based NER evaluation metrics they calculate the f1 score only with positive classes. We want to have a fair comparison between Flair and theirs. Thanks Looking forwarding for you reply. Travis

tabergma commented 5 years ago

@weiczhu, we are using only the positive labels ('B-', 'I-' excludes 'O') for calculating the f-score. The get_classes method from the metric object returned from the evaulation will return a list with all positive labels. We don't include O in that list.

nooralahzadeh commented 5 years ago

Hi, I am wondering if you follow the official evaluation of CoNLL03, where you measure the F1 based on the full extraction of Named entities?

alanakbik commented 5 years ago

Hi, yes we follow the official evaluation standard. Our script produces identical results as the CoNLL-03 script on BIO formatted tags and slightly lower results on BIOES formatted tags (the original CoNLL-03 script does not always correctly evaluate BIOES).

nooralahzadeh commented 5 years ago

I see! It means than the f1 that you reported on the paper is based on the full extraction of NE not based on the tags. But the metric class of your code is based on the tags ? rights?

alanakbik commented 5 years ago

Hi, no in both cases the metric is the same. Both the code and the paper use the same metric as used in the CoNLL-03 script. For spans, we always compare the full extraction of the NE.

The metric class can do both spans and tags, depending on the task. For PoS we use tags, but for NER we use the full NE span.

PieterDujardin commented 4 years ago

Hi, to follow up on this. Is it possible to somehow get the predictions out (i.e. in a y_pred list) so that we can calculate F1 using other libraries? Thanks!