explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.95k stars 4.39k forks source link

multiword entity #2421

Closed petacube closed 6 years ago

petacube commented 6 years ago

Your Environment

I am trying to do named entity recognition but I am getting strange results with spacy. for instance nlp("United States") treats this as two entities even though its one. How do I make it to return one label in this case "LOC" without me explicitly training it. I am using en_core_web_lg. and I see that "United States" is in nlp.vocab

the second issue I have is about misspelling. if somebody writes "United state" how do I correctly match it with spacy?

ines commented 6 years ago

for instance nlp("United States") treats this as two entities even though its one

Hmmm, I just tested it with the en_core_web_lg model, and the entity is recognised correctly as one:

doc = nlp("United States")
for ent in doc.ents:
    print(ent.text, ent.label)

# United States GPE

"United States" will be two tokens, though, because it consists of two words. But that's expected and fine – the entity itself will still be a Span consisting of two tokens.

Also, keep in mind that spaCy's models are trained on web and newspaper text so they perform best on "real" text. If you just feed it a single entity or a few words, you might see significantly worse results, because there's no context the model can take into account.

How do I make it to return one label in this case "LOC" without me explicitly training it.

You can always write to the doc.ents property and add your own entities or replace existing ones (see here for details). If you want to be able to generalise, you do need to update the model, though. In spaCy v2.x, this is a lot easier though, because you can update an existing model with only a few examples (and won't have to retrain the whole model from scratch).

One small note about the entity type LOC: The NER annotation scheme of the English models is more complex and distinguishes between GPE (geopolitical entity, i.e. everything with a government) and LOC (everything else, like an area or "Silicon Valley"). So by that definition, "United States" would be a GPE. So it might be easier to add a custom pipeline component that rewrites all GPE to LOC instead of retraining the model.

the second issue I have is about misspelling. if somebody writes "United state" how do I correctly match it with spacy?

This depends on the misspellings – but if you update the existing models with more examples of misspellings, this could potentially work very well, because both context and entity text is very similar in both the correct and misspelled case.

If you're looking for a rule-based approach, here's an example of using regular expressions to find misspellings and spelling variations in your text.

petacube commented 6 years ago

great thanks for the feedback. Do you think its feasible to identify same entity by similarity eg doc("America") token to doc("United States")?

petacube commented 6 years ago

ipdb> nlp("America").similarity(nlp("United States")) 0.60359869568994184

ipdb> nlp("US").similarity(nlp("United States")) 0.42252059274368647

ipdb> nlp("USA").similarity(nlp("United States")) 0.56863199232855832

ipdb> nlp("USA").similarity(nlp("China")) 0.58009641748563023

ines commented 6 years ago

In this case, probably not. As you can see in the example, the semantic similarity likely won't give you results that are fine-grained enough to really tell that two entities are referring to the same country. If you're only working with country names, you could also try starting off with a simpler, rule-based approach first. After all, there are only so many countries.

For a more advanced solution, you might also find this project interesting, which uses spaCy under the hood: https://github.com/openeventdata/mordecai

Full text geoparsing as a Python library. Extract the place names from a piece of text, resolve them to the correct place, and return their coordinates and structured geographic information.

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.