flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.94k stars 2.1k forks source link

Having trouble reproducing results on German CoNLL03 NER task #713

Closed cgraber closed 4 years ago

cgraber commented 5 years ago

Hi, thanks for making your code available! I'm trying to produce similar results to what you report on the CoNLL03 German NER task using the 0.5 release. When I run the following script, though, I only obtain a test F1 score of 70.48:

from pathlib import Path
from flair.data import TaggedCorpus
from flair.data_fetcher import  NLPTaskDataFetcher, NLPTask
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, CharLMEmbeddings
from flair.training_utils import EvaluationMetric
from typing import List

import numpy as np
import torch
import random
import argparse

corpus: TaggedCorpus = NLPTaskDataFetcher.load_corpus(NLPTask.CONLL_03_GERMAN, base_path='resources/tasks')

# 2. what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary.idx2item)

from flair.models import SequenceTagger
tagger = SequenceTagger.load('de-ner')

from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)
trainer.final_test(Path('tmp'), True, EvaluationMetric.MICRO_F1_SCORE,32)

Additionally, I have run the script specified in the "reproducing results" page here to train a model from scratch, and I'm getting at best 83% F1 on the test data. Is there something I'm missing or doing wrong?

alanakbik commented 5 years ago

Hello @cgraber that is strange - when I run the above script I get the following:

MICRO_AVG: acc 0.6996 - f1-score 0.8233
MACRO_AVG: acc 0.6069 - f1-score 0.736575
LOC        tp: 851 - fp: 103 - fn: 200 - tn: 851 - precision: 0.8920 - recall: 0.8097 - accuracy: 0.7374 - f1-score: 0.8489
MISC       tp: 84 - fp: 45 - fn: 122 - tn: 84 - precision: 0.6512 - recall: 0.4078 - accuracy: 0.3347 - f1-score: 0.5015
ORG        tp: 373 - fp: 127 - fn: 211 - tn: 373 - precision: 0.7460 - recall: 0.6387 - accuracy: 0.5246 - f1-score: 0.6882
PER        tp: 1091 - fp: 103 - fn: 119 - tn: 1091 - precision: 0.9137 - recall: 0.9017 - accuracy: 0.8309 - f1-score: 0.9077

Could you clarify what version of Flair you are using? 0.5 does not exist yet.

yahshibu commented 5 years ago

@alanakbik Thank you very much for your many contributions!

I met the same problem as @cgraber . But, I used v0.4.1 of Flair.

I'm guessing that dataset files might not be identical because we individually prepared them.

The following is the log I got when I trained a model:

2019-02-27 00:17:55,394 MICRO_AVG: acc 0.6973 - f1-score 0.8217
2019-02-27 00:17:55,395 MACRO_AVG: acc 0.6749 - f1-score 0.7992
2019-02-27 00:17:55,395 LOC        tp: 854 - fp: 172 - fn: 181 - tn: 854 - precision: 0.8324 - recall: 0.8251 - accuracy: 0.7075 - f1-score: 0.8287
2019-02-27 00:17:55,395 MISC       tp: 455 - fp: 123 - fn: 215 - tn: 455 - precision: 0.7872 - recall: 0.6791 - accuracy: 0.5738 - f1-score: 0.7292
2019-02-27 00:17:55,395 ORG        tp: 491 - fp: 122 - fn: 282 - tn: 491 - precision: 0.8010 - recall: 0.6352 - accuracy: 0.5486 - f1-score: 0.7085
2019-02-27 00:17:55,395 PER        tp: 1103 - fp: 73 - fn: 92 - tn: 1103 - precision: 0.9379 - recall: 0.9230 - accuracy: 0.8699 - f1-score: 0.9304

I've realized that the summation of tp and fn (the number of mentions) for each category is different.

cgraber commented 5 years ago

@alanakbik I realized that I was running on the 0.5 branch; I switched over to the v0.4.1 release, and I got the same results:

2019-05-08 11:01:16,675 MICRO_AVG: acc 0.5442 - f1-score 0.7048
2019-05-08 11:01:16,675 MACRO_AVG: acc 0.4845 - f1-score 0.5906
2019-05-08 11:01:16,675 LOC        tp: 788 - fp: 166 - fn: 247 - tn: 788 - precision: 0.8260 - recall: 0.7614 - accuracy: 0.6561 - f1-score: 0.7924
2019-05-08 11:01:16,675 MISC       tp: 31 - fp: 98 - fn: 639 - tn: 31 - precision: 0.2403 - recall: 0.0463 - accuracy: 0.0404 - f1-score: 0.0776
2019-05-08 11:01:16,675 ORG        tp: 375 - fp: 125 - fn: 398 - tn: 375 - precision: 0.7500 - recall: 0.4851 - accuracy: 0.4176 - f1-score: 0.5891
2019-05-08 11:01:16,675 PER        tp: 1079 - fp: 115 - fn: 116 - tn: 1079 - precision: 0.9037 - recall: 0.9029 - accuracy: 0.8237 - f1-score: 0.9033

it looks like I have the same number of mentions as @yahshibu. I'm going to attempt to run this again in another environment and see if I can get at least your score (my numbers seem suspiciously low).

alanakbik commented 5 years ago

Ok! I didn't realize there was an 0.5 branch - it looks like its stale (right @tabergma?) so perhaps we should delete it to avoid confusion.

yahshibu commented 5 years ago

For clarification.

I didn't resolve this problem. I pasted the log I'd obtained when training my own model, not a log from the pre-trained model, yesterday.

What I would like to say is that there might be something wrong with @alanakbik 's dataset. The statistics of CoNLL 03 is shown in Table 2 of the original paper. @cgraber 's (and my) numbers of mentions are in accord with the official numbers shown in the above paper.

tabergma commented 5 years ago

@alanakbik Yes, we can delete that branch. Created it just after the last release, but we never used it so far.

alanakbik commented 5 years ago

@yahshibu thanks for pointing this out - I'll have to take a closer look at the dataset. What embedding configuration did you train your model with?

@tabergma ok deleted it!

yahshibu commented 5 years ago

Thank you very much. I used WordEmbeddings and FlairEmbeddings (not PooledFlairEmbeddings).

My code used is as follows:

from flair.data import TaggedCorpus
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, FlairEmbeddings
from typing import List

corpus: TaggedCorpus = NLPTaskDataFetcher.load_corpus(NLPTask.CONLL_03_GERMAN, base_path='resources/tasks')

tag_type = 'ner'

tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

embedding_types: List[TokenEmbeddings] = [
    WordEmbeddings('de'),
    FlairEmbeddings('german-forward'),
    FlairEmbeddings('german-backward'),
]
embeddings: StackedEmbeddings = StackedEmbeddings(embeddings=embedding_types)

from flair.models import SequenceTagger
tagger: SequenceTagger = SequenceTagger(hidden_size=256,
                                        embeddings=embeddings,
                                        tag_dictionary=tag_dictionary,
                                        tag_type=tag_type)

from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)
trainer.train('resources/taggers/example-ner', max_epochs=150)
alanakbik commented 5 years ago

Hello all, thread #1102 provides the answer to the different evaluation numbers on CoNLL-03 German.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.