explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.21k stars 4.4k forks source link

Custom NER model returning empty list on test examples #3717

Closed PythonCoderJd closed 5 years ago

PythonCoderJd commented 5 years ago

Hello all, I am attempting to train a custom NER that correctly labels a date range. This date range will not always be in the same place in the sentence. I am training the model with about 20 unique sentences and testing that model on another 5 unseen sentences. I believe the examples I am using are similar in structure to the horses example, i.e. the new label can be anywhere in the sentence structure. The code I am using is the excellent example code provided in the spaCy docs. My training error decreases with each iteration and settles in a little below 3.0 . Yet when I test on my 5 unseen test cases the model returns an empty list. Could the community advise or point to a previous error where the custom NER model doesn’t recognize the newly trained entity and returns an empty list. Thanks!

Code I used below…pretty much verbatim from the below page on the spaCy docs.

https://spacy.io/usage/training#example-new-entity-type

Your Environment

ines commented 5 years ago

Hi! The thing is, if you're training with such a small number of examples, it's really difficult to see conclusive results. It happens to work okay in the horses example, but you usually always want to have at least a few hundred examples, if not a thousand or more.

It also depends on how easy your examples are to learn – and if you're starting off with an existing model, whether they clash with something that's already present in the model. If you're trying to add an entity type to an existing model and you've labelled date ranges, those will likely clash with the pre-trained types DATE or ORDINAL. The pre-trained model was trained on ~2 million words and all its weights are based on the presence of the existing types. If you now try to teach it that those tokens that it previously predicted as DATE are suddenly part of DATE_RANGE, you'll need a lot of examples. And you might still get confusing results, because you're constantly fighting side-effects of the existing weights.

If this is the case, you might want to consider training from scratch – and in any case, you definitely want to be using more examples. Alternatively, maybe there's a way to take advantage of the existing categories in the pre-trained model, improve them on your data and write rules to capture the full phrases you're interested in. You can find more details on this approach here: https://spacy.io/usage/rule-based-matching#models-rules

abinpaul1 commented 5 years ago

I am no expert But if your date format is similar, you can just try out Regular expressions to match it easily. Provide a few examples of training data here to get better ideas from people.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.