explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.21k stars 4.4k forks source link

NER training loss is not decreasing #3739

Closed abinpaul1 closed 5 years ago

abinpaul1 commented 5 years ago

I also posted this question in stack overflow. https://stackoverflow.com/questions/56082191/losses-in-ner-training-loop-not-decreasing-in-spacy

I am trying to train a new entity type 'HE INST'--to recognize colleges. That is the only new label. I have a long document as raw text. I ran NER on it and saved the entities to the TRAIN DATA and then added the new entity labels to the TRAIN_DATA( i replaced in places where there was overlap).

The training loop is constant at a loss value(~4000 for all the 15 texts) and (~300) for a single data. Why does this happen, how do I train the model properly. I have around 18 texts with 40 annotated new entities. Even after all iterations, the model still doesn't predict the output correctly.

I haven't changed the script much. Just added en_core_web_lg, the new label and my TRAIN_DATA

I am trying to tag institutes from resume(C.V) data:

This would be one of my text in TRAIN_DATA: (soory for the long text) I have around 18 such texts concatenated to form TRAIN_DATA

TRAIN DATA EXAMPLE.txt

Also if decide to train new custom NER (for resume entities---Institute,Programming Language,Skill) with a blank 'en' model, won't the parser,tagger,vocab of that model be really bad. How can I mitigate this?

Your Environment

The script I use to do the training:

from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
import en_core_web_lg
from spacy.util import minibatch, compounding

# new entity label
LABEL = "HE INST"

with open('train_data.txt', 'r') as i_file:
    t_data = i_file.read()
TRAIN_DATA=eval(t_data)

@plac.annotations(
    model=("en_core_web_lg", "option", "m", str),
    new_model_name=("NLP_INST", "option", "nm", str),
    #output_dir=("/home/drbinu/Downloads/NLP_INST", "option", "o", Path),
    n_iter=("30", "option", "n", int),
)
def main(model=None, new_model_name="animal", n_iter=300):
    """Set up the pipeline and entity recognizer, and train the new entity."""
    output_dir="/home/drbinu/Downloads/NLP_INST"
    random.seed(0)
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")
    # Add entity recognizer to model if it's not in the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy

    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner)
    # otherwise, get it, so we can add labels to it
    else:
        ner = nlp.get_pipe("ner")

    ner.add_label(LABEL)  # add new entity label to entity recognizer
    # Adding extraneous labels shouldn't mess anything up

    ner.add_label("VEGETABLE")
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
    move_names = list(ner.move_names)
    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    with nlp.disable_pipes(*other_pipes):  # only train NER
        sizes = compounding(1.0, 4.0, 1.001)
        # batch up the examples using spaCy's minibatch
    count = 0
        for itn in range(n_iter):
        count = count+1
            random.shuffle(TRAIN_DATA)
            batches = minibatch(TRAIN_DATA, size=sizes)
            losses = {}
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
            print(count," : ","Losses", losses)

    # test the trained model
    test_text = "B.Tech from Believers Church Caarmel Engineering College CGPA 8.9"
    doc = nlp(test_text)
    print("Entities in '%s'" % test_text)
    for ent in doc.ents:
        print(ent.label_, ent.text)

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.meta["name"] = new_model_name  # rename model
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        # Check the classes have loaded back consistently
        assert nlp2.get_pipe("ner").move_names == move_names
        doc2 = nlp2(test_text)
        for ent in doc2.ents:
            print(ent.label_, ent.text)

if __name__ == "__main__":
    plac.call(main)
honnibal commented 5 years ago

When you're calling the script, are you providing an existing model, or are you training from a blank model?

If you're training from a blank model, then I think the problem would be you're trying to learn labels that aren't in the model. If you're starting from a pretrained model, then I'm less sure what could be wrong. It could be that it's struggling to update from so few texts, maybe try setting the batch size to 1? I'm not sure though, it should learn something.

abinpaul1 commented 5 years ago

Ok. So I increased the number of examples to 70, added all the labels and trained a new model. Now it's varying in between 40-70.

If I train a new model , it wouldn't have all other features like parts of speech tagger an all of the original modal right. Should I use the new model exclusively to tag these particular entities.

ines commented 5 years ago

If I train a new model , it wouldn't have all other features like parts of speech tagger an all of the original modal right. Should I use the new model exclusively to tag these particular entities.

If you start off with a blank model, then it won't. But you can also use a pre-trained model that already has a "tagger" and "parser", and only add a new "ner" component.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.