explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.18k stars 4.4k forks source link

Spacy Matcher does not match if certain keywords are next to matched tokens. #12747

Closed newlandj closed 1 year ago

newlandj commented 1 year ago

I've found that the spacy matcher doesn't match when certain keywords are next to a matched token. This video can explain it best, but here's a text based description. If words such as "none" or "any" are directly after a matched token, the en_core_web_lg and _sm models seem to not match. If that keyword is removed by at least one other token, then things do match. I would expect that the presence of these words would not affect the matching process.

How to reproduce the behaviour

See the video above.

Your Environment

kadarakos commented 1 year ago

Hey newlandj,

To be able to precisely understand the cases you are mentioning, may I ask you to please provide a code example? Thank you!

newlandj commented 1 year ago

The code experience matches the video experience. Here's some code I plugged into a jupyter notebook to test that I'm getting the same result:

import spacy

nlp = spacy.load("en_core_web_lg")
matcher = spacy.matcher.Matcher(nlp.vocab)

# Add patterns to the matcher

pattern = [
            [
                {"LEMMA": {"IN": ["multi", "multiple"]}},
                {"ORTH": "-", "OP": "?"},
                {"LEMMA": "family", "OP": "?"},
                {"LEMMA": {"IN": ["residence", "residential", "housing", "home"]}},
            ],
            [
                {
                    "LOWER": {
                        "IN": [
                            "multifamily",
                            "condominiums",
                            "condos",
                            "residences",
                            "lihtc",
                            "duplex",
                            "lofts",
                            "apartments",
                            "apts",
                            "dwelling",
                        ]
                    }
                }
            ],
            [
                {"LEMMA": {"IN": ["multi", "multiple"]}},
                {"ORTH": "-", "OP": "?"},
                {"LEMMA": "family"},
            ],
            [{"LOWER": "subsidized"}, {"LOWER": {"IN": ["residence", "housing"]}}],
            [
                {"LEMMA": {"IN": ["residential", "apartment", "apts", "condo"]}},
                {"LEMMA": {"IN": ["building", "complex"]}},
            ],
            [
                {"LIKE_NUM": True},
                {"LOWER": "unit", "OP": "?"},
                {"LOWER": "of", "OP": "?"},
                {"LEMMA": {"IN": ["residential", "apartment", "apts"]}},
            ],
        ]

matcher.add("TEST", pattern)

# Process text.
# This will match: "condo building weird none"
# These will not: "condo building none"  "condo building any"

text = "condo building none"
with nlp.select_pipes(disable=["parser", "ner"]):
    doc = nlp(text)

# Use the matcher to find matches
matches = matcher(doc)

# Iterate over the matches and retrieve the matched tokens
for match_id, start, end in matches:
    matched_tokens = doc[start:end]
    print(matched_tokens.text)
danieldk commented 1 year ago

This is not a bug in the matcher, but ambiguity in the input. In condo building none, building is interpreted as a verb, leading to lemmatization as build:

>>> [(t.pos_, t.lemma_) for t in doc]
[('NOUN', 'condo'), ('VERB', 'build'), ('NOUN', 'none')]

This is a somewhat implausible reading, but the accuracy of models generally decreases when processing out-of-domain/genre text, like there telegram-style descriptions. At any rate, since the matcher rule matches against building, it fails to match build here.

In condo building weird none, building is interpreted as a noun, leading to the lemmatization as building:

>>> [(t.pos_, t.lemma_) for t in doc]
[('NOUN', 'condo'), ('NOUN', 'building'), ('ADJ', 'weird'), ('NOUN', 'none')]
newlandj commented 1 year ago

Fascinating, thank you @danieldk ! Is there anything I should consider doing in my setup to control how things lemmatize? For example, I don't think I have any verbs that I'm trying to match against, but I certainly have nouns, adjectives, and maybe other parts of speech.

danieldk commented 1 year ago

You can override the lemmatizer rules with exceptions. If you are working with domain-specific data, you could consider adding these systematic errors to the exception table. There is an example in this answer on the discussion board:

https://github.com/explosion/spaCy/discussions/9632#discussioncomment-1595509

newlandj commented 1 year ago

Sounds good, thanks @danieldk . I'll close out this bug report.

github-actions[bot] commented 1 year ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.