flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.89k stars 2.1k forks source link

Custom NER labels - how to format email in training? #2308

Closed codebynao closed 3 years ago

codebynao commented 3 years ago

I am trying to train a model in French with some custom NER labels, however I can't manage to detect emails properly.

My first dataset looked like:

Mon O
adresse O
mail O
est O
naomi@gmail.com B-EMAIL

I tested my model with:

from flair.data import Sentence
from flair.models import SequenceTagger

tagger = SequenceTagger.load('./model/best-model.pt')

def tag_sentence(tagger: SequenceTagger, text: str):
    sentence = Sentence(text)
    tagger.predict(sentence)
    print(sentence.to_tagged_string())

text = "Mon adresse mail est naomi@gmail.com"
tag_sentence(tagger, text)
# Mon adresse mail est naomi @ gmail.com <B-EMAIL>

Only gmail.com is detected as B-EMAIL.

I also noticed that the email was splitted naomi@gmail.com => naomi @ gmail.com so on another try I changed my dataset format to the following to see if it would make a difference:

Mon O
adresse O
mail O
est O
naomi B-EMAIL
@ I-EMAIL
gmail.com I-EMAIL

Both training formats resulted in only gmail.com being labelled.

My training file:

from flair.data import MultiCorpus
from flair.datasets import ColumnCorpus, WIKINER_FRENCH
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from flair.embeddings import WordEmbeddings, StackedEmbeddings, FlairEmbeddings

def initialize_training(max_epochs, patience):
    # define the columns of the corpus
    columns = {0: 'text', 1: 'ner'}

    # init a corpus using column format, data folder and the names of the train, dev and test files
    corpus = ColumnCorpus('/root/ner/data', columns)

    multi_corpus = MultiCorpus([WIKINER_FRENCH().downsample(0.01), corpus])

    # # tag to predict
    tag_type = 'ner'

    tag_dictionary = multi_corpus.make_tag_dictionary(tag_type=tag_type)
    print(tag_dictionary)

    # 4. initialize each embedding we use
    embedding_types = [

        # GloVe embeddings
        WordEmbeddings('fr'),

        # contextual string embeddings, forward
        FlairEmbeddings('fr-forward'),

        # contextual string embeddings, backward
        FlairEmbeddings('fr-backward'),
    ]

    # # embedding stack consists of Flair and GloVe embeddings
    embeddings = StackedEmbeddings(embeddings=embedding_types)

    tagger = SequenceTagger(hidden_size=256,
                            embeddings=embeddings,
                            tag_dictionary=tag_dictionary,
                            tag_type=tag_type)

    print('---------- Start training')
    trainer = ModelTrainer(tagger, multi_corpus)
    trainer.train('/root/ner/output',
                  mini_batch_size=32,
                  patience=patience,
                  max_epochs=max_epochs)
    print('------------ Training finished')

I am really new to ML, it is actually my first time trying to create a custom model. I don't really know what I should try next. Is the problem coming from my dataset format, my training or somewhere else?

Any help or guidance will be highly appreciated!

codebynao commented 3 years ago

Okay the solution was very simple...

I just had to specify that I don't want to use the tokeniser:

sentence = Sentence(text, use_tokenizer=False)

Now it works as expected:

{
  "text": "Mon adresse mail est naomi@gmail.com",
  "labels": [],
  "entities": [
    {
      "text": "naomi@gmail.com",
      "start_pos": 21,
      "end_pos": 36,
      "labels": [
        {
          "_value": "EMAIL",
          "_score": 0.9972683191299438
        }
      ]
    }
  ]
}