Closed weiczhu closed 5 years ago
@weiczhu, we are using only the positive labels ('B-', 'I-' excludes 'O') for calculating the f-score. The get_classes
method from the metric
object returned from the evaulation will return a list with all positive labels. We don't include O
in that list.
Hi, I am wondering if you follow the official evaluation of CoNLL03, where you measure the F1 based on the full extraction of Named entities?
Hi, yes we follow the official evaluation standard. Our script produces identical results as the CoNLL-03 script on BIO formatted tags and slightly lower results on BIOES formatted tags (the original CoNLL-03 script does not always correctly evaluate BIOES).
I see! It means than the f1 that you reported on the paper is based on the full extraction of NE not based on the tags. But the metric class of your code is based on the tags ? rights?
Hi, no in both cases the metric is the same. Both the code and the paper use the same metric as used in the CoNLL-03 script. For spans, we always compare the full extraction of the NE.
The metric class can do both spans and tags, depending on the task. For PoS we use tags, but for NER we use the full NE span.
Hi, to follow up on this. Is it possible to somehow get the predictions out (i.e. in a y_pred list) so that we can calculate F1 using other libraries? Thanks!
Hi, We are planning to do a compare with Flair and other model for NER. As for the F1 score, there is a function get_classes, could you clarify what classed are used for calculating the f1 score? Do you calculate the micro average for all the labels ('B-', 'I-' and includes 'O') or just the positive labels ('B-', 'I-' excludes 'O') ? Because I see for some BERT based NER evaluation metrics they calculate the f1 score only with positive classes. We want to have a fair comparison between Flair and theirs. Thanks Looking forwarding for you reply. Travis