flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.85k stars 2.09k forks source link

Trouble reproducing your GermEval 2014 results #778

Closed wblacoe closed 5 years ago

wblacoe commented 5 years ago

Hi. I can't reproduce your F1 score of 84.65 for GermEval 2014. I used everything out of the box:

corpus = datasets.GERMEVAL()
tagger = flair.models.SequenceTagger.load('de-ner-germeval')
sentences = corpus.test.sentences
result, loss = tagger.evaluate(sentences)
print(result.detailed_results)

But my results are these:

MICRO_AVG: acc 0.6475 - f1-score 0.7861
MACRO_AVG: acc 0.3776 - f1-score 0.4910583333333334
LOC        tp: 1449 - fp: 240 - fn: 257 - tn: 1449 - precision: 0.8579 - recall: 0.8494 - accuracy: 0.7446 - f1-score: 0.8536
LOCderiv   tp: 486 - fp: 98 - fn: 75 - tn: 486 - precision: 0.8322 - recall: 0.8663 - accuracy: 0.7375 - f1-score: 0.8489
LOCpart    tp: 48 - fp: 31 - fn: 61 - tn: 48 - precision: 0.6076 - recall: 0.4404 - accuracy: 0.3429 - f1-score: 0.5107
ORG        tp: 795 - fp: 322 - fn: 355 - tn: 795 - precision: 0.7117 - recall: 0.6913 - accuracy: 0.5401 - f1-score: 0.7014
ORGderiv   tp: 0 - fp: 0 - fn: 8 - tn: 0 - precision: 0.0000 - recall: 0.0000 - accuracy: 0.0000 - f1-score: 0.0000
ORGpart    tp: 109 - fp: 105 - fn: 63 - tn: 109 - precision: 0.5093 - recall: 0.6337 - accuracy: 0.3935 - f1-score: 0.5647
OTH        tp: 372 - fp: 144 - fn: 325 - tn: 372 - precision: 0.7209 - recall: 0.5337 - accuracy: 0.4423 - f1-score: 0.6133
OTHderiv   tp: 10 - fp: 5 - fn: 29 - tn: 10 - precision: 0.6667 - recall: 0.2564 - accuracy: 0.2273 - f1-score: 0.3704
OTHpart    tp: 2 - fp: 4 - fn: 40 - tn: 2 - precision: 0.3333 - recall: 0.0476 - accuracy: 0.0435 - f1-score: 0.0833
PER        tp: 1487 - fp: 222 - fn: 152 - tn: 1487 - precision: 0.8701 - recall: 0.9073 - accuracy: 0.7990 - f1-score: 0.8883
PERderiv   tp: 2 - fp: 1 - fn: 9 - tn: 2 - precision: 0.6667 - recall: 0.1818 - accuracy: 0.1667 - f1-score: 0.2857
PERpart    tp: 5 - fp: 9 - fn: 39 - tn: 5 - precision: 0.3571 - recall: 0.1136 - accuracy: 0.0943 - f1-score: 0.1724

Did you use some other F1-measure than micro-averaged? Did you use a different set of classes? Thanks

alanakbik commented 5 years ago

Thanks for reporting this - I was able to reproduce your results. I'll take a closer look at what's happening here and report back!

alanakbik commented 5 years ago

Hello @wblacoe the model was somehow not working, but when I trained a new one with the current version I got good results. I think the germeval model was very old and trained many Flair versions ago so it's possible that changes since then have somehow impacted its accuracy. I've been planning for a while now to retrain everything for the new version and will do this at the latest for the next major release.

In the meantime, I've just pushed a PR that updates the germeval model. The new model now reaches an F1 score of 84.85 using the standard learning parameters. I did not do any experimentation or try any of the new embeddings, so better results are likely possible.

If you want to try the model you can either update your Flair version to the current master, or download the model from here and load it in the SequenceTagger. Let me know if this works!

wblacoe commented 5 years ago

Thanks a bunch for checking this @alanakbik! Yes, it works for me too now:

MICRO_AVG: acc 0.7368 - f1-score 0.8485
MACRO_AVG: acc 0.4965 - f1-score 0.6249833333333333
LOC        tp: 1554 - fp: 192 - fn: 152 - tn: 1554 - precision: 0.8900 - recall: 0.9109 - accuracy: 0.8188 - f1-score: 0.9003
LOCderiv   tp: 525 - fp: 65 - fn: 36 - tn: 525 - precision: 0.8898 - recall: 0.9358 - accuracy: 0.8387 - f1-score: 0.9122
LOCpart    tp: 60 - fp: 6 - fn: 49 - tn: 60 - precision: 0.9091 - recall: 0.5505 - accuracy: 0.5217 - f1-score: 0.6857
ORG        tp: 884 - fp: 196 - fn: 266 - tn: 884 - precision: 0.8185 - recall: 0.7687 - accuracy: 0.6568 - f1-score: 0.7928
ORGderiv   tp: 2 - fp: 1 - fn: 6 - tn: 2 - precision: 0.6667 - recall: 0.2500 - accuracy: 0.2222 - f1-score: 0.3636
ORGpart    tp: 107 - fp: 44 - fn: 65 - tn: 107 - precision: 0.7086 - recall: 0.6221 - accuracy: 0.4954 - f1-score: 0.6625
OTH        tp: 444 - fp: 134 - fn: 253 - tn: 444 - precision: 0.7682 - recall: 0.6370 - accuracy: 0.5343 - f1-score: 0.6965
OTHderiv   tp: 23 - fp: 13 - fn: 16 - tn: 23 - precision: 0.6389 - recall: 0.5897 - accuracy: 0.4423 - f1-score: 0.6133
OTHpart    tp: 11 - fp: 4 - fn: 31 - tn: 11 - precision: 0.7333 - recall: 0.2619 - accuracy: 0.2391 - f1-score: 0.3860
PER        tp: 1528 - fp: 140 - fn: 111 - tn: 1528 - precision: 0.9161 - recall: 0.9323 - accuracy: 0.8589 - f1-score: 0.9241
PERderiv   tp: 3 - fp: 4 - fn: 8 - tn: 3 - precision: 0.4286 - recall: 0.2727 - accuracy: 0.2000 - f1-score: 0.3333
PERpart    tp: 7 - fp: 10 - fn: 37 - tn: 7 - precision: 0.4118 - recall: 0.1591 - accuracy: 0.1296 - f1-score: 0.2295
pascalhuszar commented 4 years ago

Is the metric evalaution script "strict" for the GermEval14 set? For instance: LOCderiv is predicted as LOC. Will this be considered as a (truely) false prediciton? @alanakbik

alanakbik commented 4 years ago

Yes it does exact matching so it would take a strict interpretation and count this as false.