Closed cgraber closed 4 years ago
Hello @cgraber that is strange - when I run the above script I get the following:
MICRO_AVG: acc 0.6996 - f1-score 0.8233
MACRO_AVG: acc 0.6069 - f1-score 0.736575
LOC tp: 851 - fp: 103 - fn: 200 - tn: 851 - precision: 0.8920 - recall: 0.8097 - accuracy: 0.7374 - f1-score: 0.8489
MISC tp: 84 - fp: 45 - fn: 122 - tn: 84 - precision: 0.6512 - recall: 0.4078 - accuracy: 0.3347 - f1-score: 0.5015
ORG tp: 373 - fp: 127 - fn: 211 - tn: 373 - precision: 0.7460 - recall: 0.6387 - accuracy: 0.5246 - f1-score: 0.6882
PER tp: 1091 - fp: 103 - fn: 119 - tn: 1091 - precision: 0.9137 - recall: 0.9017 - accuracy: 0.8309 - f1-score: 0.9077
Could you clarify what version of Flair you are using? 0.5 does not exist yet.
@alanakbik Thank you very much for your many contributions!
I met the same problem as @cgraber . But, I used v0.4.1 of Flair.
I'm guessing that dataset files might not be identical because we individually prepared them.
The following is the log I got when I trained a model:
2019-02-27 00:17:55,394 MICRO_AVG: acc 0.6973 - f1-score 0.8217
2019-02-27 00:17:55,395 MACRO_AVG: acc 0.6749 - f1-score 0.7992
2019-02-27 00:17:55,395 LOC tp: 854 - fp: 172 - fn: 181 - tn: 854 - precision: 0.8324 - recall: 0.8251 - accuracy: 0.7075 - f1-score: 0.8287
2019-02-27 00:17:55,395 MISC tp: 455 - fp: 123 - fn: 215 - tn: 455 - precision: 0.7872 - recall: 0.6791 - accuracy: 0.5738 - f1-score: 0.7292
2019-02-27 00:17:55,395 ORG tp: 491 - fp: 122 - fn: 282 - tn: 491 - precision: 0.8010 - recall: 0.6352 - accuracy: 0.5486 - f1-score: 0.7085
2019-02-27 00:17:55,395 PER tp: 1103 - fp: 73 - fn: 92 - tn: 1103 - precision: 0.9379 - recall: 0.9230 - accuracy: 0.8699 - f1-score: 0.9304
I've realized that the summation of tp and fn (the number of mentions) for each category is different.
@alanakbik I realized that I was running on the 0.5 branch; I switched over to the v0.4.1 release, and I got the same results:
2019-05-08 11:01:16,675 MICRO_AVG: acc 0.5442 - f1-score 0.7048
2019-05-08 11:01:16,675 MACRO_AVG: acc 0.4845 - f1-score 0.5906
2019-05-08 11:01:16,675 LOC tp: 788 - fp: 166 - fn: 247 - tn: 788 - precision: 0.8260 - recall: 0.7614 - accuracy: 0.6561 - f1-score: 0.7924
2019-05-08 11:01:16,675 MISC tp: 31 - fp: 98 - fn: 639 - tn: 31 - precision: 0.2403 - recall: 0.0463 - accuracy: 0.0404 - f1-score: 0.0776
2019-05-08 11:01:16,675 ORG tp: 375 - fp: 125 - fn: 398 - tn: 375 - precision: 0.7500 - recall: 0.4851 - accuracy: 0.4176 - f1-score: 0.5891
2019-05-08 11:01:16,675 PER tp: 1079 - fp: 115 - fn: 116 - tn: 1079 - precision: 0.9037 - recall: 0.9029 - accuracy: 0.8237 - f1-score: 0.9033
it looks like I have the same number of mentions as @yahshibu. I'm going to attempt to run this again in another environment and see if I can get at least your score (my numbers seem suspiciously low).
Ok! I didn't realize there was an 0.5 branch - it looks like its stale (right @tabergma?) so perhaps we should delete it to avoid confusion.
For clarification.
I didn't resolve this problem. I pasted the log I'd obtained when training my own model, not a log from the pre-trained model, yesterday.
What I would like to say is that there might be something wrong with @alanakbik 's dataset. The statistics of CoNLL 03 is shown in Table 2 of the original paper. @cgraber 's (and my) numbers of mentions are in accord with the official numbers shown in the above paper.
@alanakbik Yes, we can delete that branch. Created it just after the last release, but we never used it so far.
@yahshibu thanks for pointing this out - I'll have to take a closer look at the dataset. What embedding configuration did you train your model with?
@tabergma ok deleted it!
Thank you very much. I used WordEmbeddings and FlairEmbeddings (not PooledFlairEmbeddings).
My code used is as follows:
from flair.data import TaggedCorpus
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, FlairEmbeddings
from typing import List
corpus: TaggedCorpus = NLPTaskDataFetcher.load_corpus(NLPTask.CONLL_03_GERMAN, base_path='resources/tasks')
tag_type = 'ner'
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
embedding_types: List[TokenEmbeddings] = [
WordEmbeddings('de'),
FlairEmbeddings('german-forward'),
FlairEmbeddings('german-backward'),
]
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)
from flair.models import SequenceTagger
tagger: SequenceTagger = SequenceTagger(hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type=tag_type)
from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
trainer.train('resources/taggers/example-ner', max_epochs=150)
Hello all, thread #1102 provides the answer to the different evaluation numbers on CoNLL-03 German.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi, thanks for making your code available! I'm trying to produce similar results to what you report on the CoNLL03 German NER task using the 0.5 release. When I run the following script, though, I only obtain a test F1 score of 70.48:
Additionally, I have run the script specified in the "reproducing results" page here to train a model from scratch, and I'm getting at best 83% F1 on the test data. Is there something I'm missing or doing wrong?