explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.21k stars 4.4k forks source link

Extremely High Losses when Adding New Entity Label to NER #3789

Closed zakraicik closed 5 years ago

zakraicik commented 5 years ago

Issue is similar to #2783 but the resolution isn't clear to me.

I have roughly 30k documents that I am using to update the large english spaCy model. Additionally, I am using the 30K to add an additional entity label to the NER model. Each document has been fully annotated with existing entities identified using the large english model as well as the new entity label I am trying to add. The documents are very long, and any individual document contains a lot of entities (~25).

The entity I am adding is birthday, so as part of my dataset creation I have to override the label for some entities spaCy is tagging as dates as birthdays.

After training, the performance metrics seem okay (ents_r = 86). Entities tagged on the validation set seem to make sense and the results seem reasonable.

However, the losses during training are very high (in the millions). What is causing this? I am only adding a single label that does not appear too often. I attached some code snippets below.

Annotating existing entities using spaCy

def SpacyEnts(df):

    SpacyOffsets = []

    DisabledPipes = ['parser','tagger']
    with nlp.disable_pipes(*DisabledPipes):
        for doc in tqdm(nlp.pipe(df['clean_txt'].tolist())):
            Entitities = [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents]
            SpacyOffsets.append(Entitities)

    return SpacyOffsets

Annotating entities tagged manually

def TrainingDataEnts(df,labels):

    TrainingOffsets = []

    z = 0 
    for i,text in tqdm(enumerate(df['clean_txt'].tolist())):

        RelevantLabels = labels[df.iloc[i,:]['id']]
        doc = nlp(text)

        Matches = []
        for label in RelevantLabels:
            regex = re.compile('\\b'+re.escape(label)+'\\b')
            for match in re.finditer(regex,df['clean_txt'][i]):
                span = doc.char_span(match.start(), match.end(), label=label)
                if span:
                    Matches.append((match.start(),match.end(), LabelType[label]))
                if not span:
                    z = z + 1

        TrainingOffsets.append(Matches)

    print("%s labels removed because they are not valid spaCy spans"  %z)

    return TrainingOffsets

Combining character offsets

def CombineEntities(SpacyOffsets,TrainingOffsets,df):

    FinalOffsets = {}

    for i in tqdm(range(len(SpacyOffsets))):

        RemainingOffsets = []

        CombinedOffsets = SpacyOffsets[i] + TrainingOffsets[i]
        SortedByLowerBound = sorted(CombinedOffsets, key=lambda tup: tup[0])

        for start, end, label in SortedByLowerBound:
            for Higher in SortedByLowerBound:
                if not RemainingOffsets:
                    RemainingOffsets.append(Higher)
                else:
                    Lower = RemainingOffsets[-1]
                    if Higher[0] <= Lower[1]:
                        UpperBound = max(Lower[1], Higher[1])
                        RemainingOffsets[-1] = (Lower[0],UpperBound, Higher[2])  
                    else:
                        RemainingOffsets.append(Higher)

        FinalOffsets[df.iloc[i,:]['id']] = RemainingOffsets

    return FinalOffsets

Create training data

TRAIN_DATA = [(df.iloc[i,:]['clean_txt'],{'entities': FinalOffsets[df.iloc[i,:]['id']]}) for i, txt in tqdm(enumerate(df['clean_txt'].tolist()))]

Train Model

def main(model=None, new_model_name=None, output_dir=None, n_iter=30):

    """Set up the pipeline and entity recognizer, and train the new entity."""
    random.seed(0)
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe("ner")

    ner.add_label(LABEL)  # add new entity label to entity recognizer

    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
    move_names = list(ner.move_names)

    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        sizes = compounding(1.0,5.0, 1.000005)
        # batch up the examples using spaCy's minibatch
        for itn in tqdm(range(n_iter)):
            print('Training Model: Iteration '+ str(itn))
            random.shuffle(TRAIN_DATA)
            batches = minibatch(TRAIN_DATA, size=sizes)
            losses = {}
            for batch in tqdm(batches):
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)

            scores = evaluate(nlp, TRAIN_DATA)
            print("Scores ",scores)
            print("Losses", losses)

    # test the trained model
    test_text = "My name is Zak Raici, I was born july 28th 1991"
    doc = nlp(test_text)
    print("Entities in '%s'" % test_text)
    for ent in doc.ents:
        print(ent.label_, ent.text)

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta["name"] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        # Check the classes have loaded back consistently
        assert nlp2.get_pipe("ner").move_names == move_names
        doc2 = nlp2(test_text)
        for ent in doc2.ents:
            print(ent.label_, ent.text)

StartTime = time()

main(Model, Model+str('w_pii_v1'), OutputPath, 40 )

EndTime = time()

print('Model took %s minutes to train' % int((EndTime - StartTime)/60))

I attached a screenshot of the performance metrics I am seeing after the first iteration. The performance metrics improve in the later iterations, eventually reaching ents_r of 86. However, the loss remains huge. How is this possible?

Screen Shot 2019-05-29 at 11 15 19 AM
honnibal commented 5 years ago

Well, the model starts out dead certain that the entities you're tagging as BIRTHDAY are DATE entities. You have to unwind that knowledge to retype them, and so the losses are very high. I think this is the explanation anyway -- maybe something else is going on.

Have you considered just adding a second tagging model, that bootstraps off the first one and retypes the DATE entities? It might make the problem easier to learn.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.