explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.69k stars 4.36k forks source link

Expanding named entities fails to detect entity type #12489

Closed andyjessen closed 1 year ago

andyjessen commented 1 year ago

The "expand_person_entities" example returns a modified entity label when entity has a title. When another person is added who doesn't have a title, the pipeline fails to detect the "PERSON" entity.

Current

import spacy
from spacy.language import Language
from spacy.tokens import Span

nlp = spacy.load("en_core_web_sm")
@Language.component("expand_person_entities")
def expand_person_entities(doc):
    new_ents = []
    for ent in doc.ents:
        if ent.label_ == "PERSON" and ent.start != 0:
            prev_token = doc[ent.start - 1]
            if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
                new_ent = Span(doc, ent.start - 1, ent.end, label=ent.label)
                new_ents.append(new_ent)
        else:
            new_ents.append(ent)
    doc.ents = new_ents
    return doc

# Add the component after the named entity recognizer
nlp.add_pipe("expand_person_entities", after="ner")

doc = nlp("Dr. Alex Smith chaired first board meeting of Acme Corp Inc.")
print([(ent.text, ent.label_) for ent in doc.ents])

Returns

[('Dr. Alex Smith', 'PERSON'), ('first', 'ORDINAL'), ('Acme Corp Inc.', 'ORG')]

Modified text to include additional person

doc = nlp("Dr. Alex Smith and John Smith chaired the first board meeting of Acme Corp Inc.")

Returns

[('Dr. Alex Smith', 'PERSON'), ('first', 'ORDINAL'), ('Acme Corp Inc.', 'ORG')]

Expected

[('Dr. Alex Smith', 'PERSON'), ('John Smith', 'PERSON'), ('first', 'ORDINAL'), ('Acme Corp Inc.', 'ORG')]

Which page or section is this issue related to?

https://spacy.io/usage/rule-based-matching

shadeMe commented 1 year ago

Nice catch - thanks for the report! We'll update the example to correctly include PERSON entities that do not have a title.

github-actions[bot] commented 1 year ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.