explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.9k stars 4.39k forks source link

Predefined entities not detected after adding custom entities #1382

Closed alekaizer closed 6 years ago

alekaizer commented 7 years ago

I followed the entity adding tutorial on the spacy website, and after adding custom entities, when I run the nlp.doc against a test data, the predefined entities like: ORG, DATE, etc... are not detected anymore, and those entities are mistaken for new entities.

Here is the code:

 from __future__ import unicode_literals, print_function

 import random
 from pathlib import Path
 import random

import spacy
from cucco import Cucco
from spacy.gold import GoldParse
from spacy.tagger import Tagger

normalizr = Cucco()
normalizations = ['remove_extra_white_spaces', 'replace_punctuation',
              'replace_symbols', 'remove_accent_marks']

 def train_ner(nlp, train_data, output_dir):
     # Add new words to vocab
     for raw_text, _ in train_data:
       doc = nlp.make_doc(raw_text)
       for word in doc:
           _ = nlp.vocab[word.orth]
    random.seed(0)
    nlp.entity.model.learn_rate = 0.001
    for itn in range(10000):
       random.shuffle(train_data)
       loss = 0.
       for raw_text, entity_offsets in train_data:
           doc = nlp.make_doc(raw_text)
           gold = GoldParse(doc, entities=entity_offsets)
           nlp.tagger(doc)
           loss += nlp.entity.update(doc, gold, drop=0.9)
       if loss == 0:
            break
nlp.end_training()
if output_dir:
    if not output_dir.exists():
        output_dir.mkdir()
    nlp.save_to_directory(output_dir)

 test_data = [
         "I need to go to Facebook, and Google tomorrow",
         "there is something on the 23/10/2018",
         "Georges Bush is the 43th president of United States of America"
 ]

def main(model_name="en", output_directory=None):
    print("Loading initial model", model_name)
    nlp = spacy.load(model_name)
    if output_directory is not None:
        output_directory = Path(output_directory)

    train_data = [('hey', [(0, 3, 'GREETINGS')]),
              ('how are you', [(0, 11, 'GREETINGS')]),
              ('how you doing', [(0, 13, 'GREETINGS')]),
              ('howdy', [(0, 5, 'GREETINGS')]),
              ('wuddup', [(0, 6, 'GREETINGS')]),
              ('Hi, how are you', [(0, 15, 'GREETINGS')]),
              ('Hey, how are you doing', [(0, 22, 'GREETINGS')]),
              ('howdy', [(0, 5, 'GREETINGS')]),
              ('hello, good morning', [(0, 19, 'GREETINGS')]),
              ('howdy', [(0, 5, 'GREETINGS')]),
              ("what's up", [(0, 9, 'GREETINGS')]),
              ('hey there', [(0, 9, 'GREETINGS')]),
              ('hello', [(0, 5, 'GREETINGS')]),
              ('hi', [(0, 2, 'GREETINGS')]),
              ('good morning', [(0, 12, 'GREETINGS')]),
              ('good evening', [(0, 12, 'GREETINGS')]),
              ('dear sir', [(0, 8, 'GREETINGS')]),
              ('yes', [(0, 3, 'AFFIRM')]),
              ('yep', [(0, 3, 'AFFIRM')]),
              ('yeah', [(0, 4, 'AFFIRM')]),
              ('indeed', [(0, 6, 'AFFIRM')]),
              ("that's right", [(0, 12, 'AFFIRM')]),
              ('ok', [(0, 2, 'AFFIRM')]),
              ('great', [(0, 5, 'AFFIRM')]),
              ('right, thank you', [(0, 16, 'AFFIRM')]),
              ('thank you', [(0, 9, 'AFFIRM')]),
              ('thank you for your help', [(0, 23, 'AFFIRM')]),
              ('thanks for your help', [(0, 20, 'AFFIRM')]),
              ('thanx for your help', [(0, 19, 'AFFIRM')]),
              ('correct', [(0, 7, 'AFFIRM')]),
              ('great choice', [(0, 12, 'AFFIRM')]),
              ('sounds really good', [(0, 18, 'AFFIRM')]),
              ('bye', [(0, 3, 'GOODBYE')]),
              ('GOODBYE', [(0, 7, 'GOODBYE')]),
              ('good bye', [(0, 8, 'GOODBYE')]),
              ('stop', [(0, 4, 'GOODBYE')]),
              ('end', [(0, 3, 'GOODBYE')]),
              ('farewell', [(0, 8, 'GOODBYE')]),
              ('Bye bye', [(0, 7, 'GOODBYE')]),
              ('have a good one', [(0, 15, 'GOODBYE')]),
              ('arse', [(0, 4, 'CURSE')]),
              ('ass', [(0, 3, 'CURSE')]),
              ('asshole', [(0, 7, 'CURSE')]),
              ('you fucking asshole', [(0, 19, 'CURSE')]),
              ('bastard', [(0, 7, 'CURSE')]),
              ('you fucking bastard', [(0, 19, 'CURSE')]),
              ('bitch', [(0, 5, 'CURSE')]),
              ('hey bitch', [(0, 9, 'CURSE')]),
              ('you dumb bitch', [(0, 14, 'CURSE')]),
              ('crap', [(0, 4, 'CURSE')]),
              ('oh crap', [(0, 7, 'CURSE')]),
              ('cockfucker', [(0, 10, 'CURSE')]),
              ('cocksucker', [(0, 10, 'CURSE')]),
              ('cunt', [(0, 4, 'CURSE')]), ('damn', [(0, 4, 'CURSE')]),
              ('dammit', [(0, 6, 'CURSE')]), ('fuck', [(0, 4, 'CURSE')]),
              ('fucking', [(0, 7, 'CURSE')]),
              ('goddamn', [(0, 7, 'CURSE')]),
              ('god dammit', [(0, 10, 'CURSE')]),
              ('godsdamn', [(0, 8, 'CURSE')]),
              ('hell no', [(0, 7, 'AFFIRM')]),
              ('hell yeah', [(0, 9, 'AFFIRM')]),
              ('holy shit', [(0, 9, 'CURSE')]),
              ('retard', [(0, 6, 'CURSE')]), ('shitty', [(0, 6, 'CURSE')]),
              ('motherfucker', [(0, 12, 'CURSE')]),
              ('stupid motherfucker', [(0, 19, 'CURSE')]),
              ('retarded motherfucker', [(0, 21, 'CURSE')]),
              ('where is my motherfucking money', [(12, 25, 'CURSE'), (0, 12, 'WHQ')]),
              ('son of a bitch', [(0, 14, 'CURSE')]),
              ('shit', [(0, 4, 'CURSE')]), ('bullshit', [(0, 8, 'CURSE')]),
              ("Where are you located", [(0, 22, 'LOC-Q'), (0, 9, 'WHQ')]),
              ("What is your address", [(0, 21, 'LOC-Q'), (0, 7, 'WHQ')]),
              ("what is your location", [(0, 22, 'LOC-Q'), (0, 7, 'WHQ')]),
              ("where is your offices", [(0, 22, 'LOC-Q'), (0, 8, 'WHQ')]),
              ("who are you", [(0, 3, 'PERSON')]),
              ("who is Morgabn Freeman", [(0, 3, 'PERSON')]),
              ("who the fuck are you", [(0, 3, 'PERSON'), (8, 12, 'CURSE')]),
              ("Where can I find your headquarter",
               [(0, 15, 'LOC-Q'), (0, 11, 'WHQ')])
              ]

   nlp.entity.add_label('GOODBYE')
   nlp.entity.add_label('CURSE')
   nlp.entity.add_label('AFFIRM')
   nlp.entity.add_label('GREETINGS')
   nlp.entity.add_label('LOC-Q')
   nlp.entity.add_label('WHQ')
   nlp.entity.add_label('PERSON')
   train_ner(nlp, train_data, output_directory)

   # Test that the entity is recognized
   for text in test_data:
       doc = nlp(normalizr.normalize(text, normalizations))
       print("Ents in '{0}':".format(text))
       for ent in doc.ents:
           print(ent.label_, ent.text)
       if output_directory:
           print("Loading from", output_directory)
           nlp2 = spacy.load('en', path=output_directory)
           nlp2.entity.add_label('ANIMAL')
           doc2 = nlp2('Do you like horses?')
           for ent in doc2.ents:
              print(ent.label_, ent.text)
       print()

if __name__ == '__main__':
   import plac
   plac.call(main)

and here is the output:

Loading initial model en
Ents in 'I need to go to Facebook, and Google tomorrow':
CURSE I need to

Ents in 'there is something on the 23/10/2018':
WHQ there is
CURSE 23102018

Ents in 'Georges Bush is the 43th president of United States of America':
GOODBYE Georges
CURSE 43th

Process finished with exit code 0

Isn't the entities supposed to be added to the existing entities and not erase them ?

My Environment

Info about spaCy

honnibal commented 7 years ago

See here for what might be going on: https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting . The next version of the docs references this more explicitly.

Try making fewer iterations, too --- you're probably overfitting on those examples.

alekaizer commented 7 years ago

thx @honnibal , variating the iterations output different answers:

Without adding the new entities, the output is correct:

Ents in 'I need to go to Facebook, and Google tomorrow':
GPE Facebook
ORG Google
DATE tomorrow

Ents in 'there is something on the 23/10/2018':
DATE 23102018

Ents in 'Georges Bush is the 43th president of United States of America':
PERSON Georges Bush
GPE United States of America

Is there a way to calculate the right iteration number ? and also seems like adding the new entities break some stuffs, for example GPE is incomplete when the entities are added, etc...

honnibal commented 6 years ago

Please see here --- the training is much improved in v2, and we've tried to give a lot more guidance about how to make good use of it: https://spacy.io/usage/training

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.