Custom NER model returning empty list on test examples

PythonCoderJd commented 5 years ago

Hello all, I am attempting to train a custom NER that correctly labels a date range. This date range will not always be in the same place in the sentence. I am training the model with about 20 unique sentences and testing that model on another 5 unseen sentences. I believe the examples I am using are similar in structure to the horses example, i.e. the new label can be anywhere in the sentence structure. The code I am using is the excellent example code provided in the spaCy docs. My training error decreases with each iteration and settles in a little below 3.0 . Yet when I test on my 5 unseen test cases the model returns an empty list. Could the community advise or point to a previous error where the custom NER model doesn’t recognize the newly trained entity and returns an empty list. Thanks!

Code I used below…pretty much verbatim from the below page on the spaCy docs.

https://spacy.io/usage/training#example-new-entity-type

Your Environment

Operating System:
Python Version Used: 3.7.3
spaCy Version Used: 2.1.3
Environment Information:

```` Label= “DateRange” #Set up the pipeline and entity recognizer, and train the new entity. random.seed(7) if model is not None: nlp = spacy.load(model) # load existing spaCy model print("Loaded model '%s'" % model) else: nlp = spacy.blank("en") # create blank Language class print("Created blank 'en' model") # Add entity recognizer to model if it's not in the pipeline # nlp.create_pipe works for built-ins that are registered with spaCy if "ner" not in nlp.pipe_names: ner = nlp.create_pipe("ner") nlp.add_pipe(ner) # otherwise, get it, so we can add labels to it else: ner = nlp.get_pipe("ner") ner.add_label(LABEL) # add new entity label to entity recognizer # Adding extraneous labels shouldn't mess anything up ner.add_label("DateRange") if model is None: optimizer = nlp.begin_training() else: optimizer = nlp.resume_training() move_names = list(ner.move_names) # get names of other pipes to disable them during training other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"] with nlp.disable_pipes(*other_pipes): # only train NER sizes = compounding(1.0, 4.0, 2.0) # batch up the examples using spaCy's minibatch for itn in range(n_iter): random.shuffle(TRAIN_DATA) batches = minibatch(TRAIN_DATA, size=sizes) losses = {} for batch in batches: texts, annotations = zip(*batch) nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses) print("Losses", losses) # test the trained model doc = nlp(test_text) print("Entities in '%s'" % test_text) for ent in doc.ents: print(ent.label_, ent.text) # save model to output directory if output_dir is not None: output_dir = Path(output_dir) if not output_dir.exists(): output_dir.mkdir() nlp.to_disk(output_dir) print("Saved model to", output_dir) # test the saved model print("Loading from", output_dir) nlp2 = spacy.load(output_dir) ```

ines commented 5 years ago

Hi! The thing is, if you're training with such a small number of examples, it's really difficult to see conclusive results. It happens to work okay in the horses example, but you usually always want to have at least a few hundred examples, if not a thousand or more.

It also depends on how easy your examples are to learn – and if you're starting off with an existing model, whether they clash with something that's already present in the model. If you're trying to add an entity type to an existing model and you've labelled date ranges, those will likely clash with the pre-trained types DATE or ORDINAL. The pre-trained model was trained on ~2 million words and all its weights are based on the presence of the existing types. If you now try to teach it that those tokens that it previously predicted as DATE are suddenly part of DATE_RANGE, you'll need a lot of examples. And you might still get confusing results, because you're constantly fighting side-effects of the existing weights.

If this is the case, you might want to consider training from scratch – and in any case, you definitely want to be using more examples. Alternatively, maybe there's a way to take advantage of the existing categories in the pre-trained model, improve them on your data and write rules to capture the full phrases you're interested in. You can find more details on this approach here: https://spacy.io/usage/rule-based-matching#models-rules

abinpaul1 commented 5 years ago

I am no expert But if your date format is similar, you can just try out Regular expressions to match it easily. Provide a few examples of training data here to get better ideas from people.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy

Custom NER model returning empty list on test examples #3717

Your Environment