Closed newlandj closed 1 year ago
Hey newlandj,
To be able to precisely understand the cases you are mentioning, may I ask you to please provide a code example? Thank you!
The code experience matches the video experience. Here's some code I plugged into a jupyter notebook to test that I'm getting the same result:
import spacy
nlp = spacy.load("en_core_web_lg")
matcher = spacy.matcher.Matcher(nlp.vocab)
# Add patterns to the matcher
pattern = [
[
{"LEMMA": {"IN": ["multi", "multiple"]}},
{"ORTH": "-", "OP": "?"},
{"LEMMA": "family", "OP": "?"},
{"LEMMA": {"IN": ["residence", "residential", "housing", "home"]}},
],
[
{
"LOWER": {
"IN": [
"multifamily",
"condominiums",
"condos",
"residences",
"lihtc",
"duplex",
"lofts",
"apartments",
"apts",
"dwelling",
]
}
}
],
[
{"LEMMA": {"IN": ["multi", "multiple"]}},
{"ORTH": "-", "OP": "?"},
{"LEMMA": "family"},
],
[{"LOWER": "subsidized"}, {"LOWER": {"IN": ["residence", "housing"]}}],
[
{"LEMMA": {"IN": ["residential", "apartment", "apts", "condo"]}},
{"LEMMA": {"IN": ["building", "complex"]}},
],
[
{"LIKE_NUM": True},
{"LOWER": "unit", "OP": "?"},
{"LOWER": "of", "OP": "?"},
{"LEMMA": {"IN": ["residential", "apartment", "apts"]}},
],
]
matcher.add("TEST", pattern)
# Process text.
# This will match: "condo building weird none"
# These will not: "condo building none" "condo building any"
text = "condo building none"
with nlp.select_pipes(disable=["parser", "ner"]):
doc = nlp(text)
# Use the matcher to find matches
matches = matcher(doc)
# Iterate over the matches and retrieve the matched tokens
for match_id, start, end in matches:
matched_tokens = doc[start:end]
print(matched_tokens.text)
This is not a bug in the matcher, but ambiguity in the input. In condo building none, building is interpreted as a verb, leading to lemmatization as build:
>>> [(t.pos_, t.lemma_) for t in doc]
[('NOUN', 'condo'), ('VERB', 'build'), ('NOUN', 'none')]
This is a somewhat implausible reading, but the accuracy of models generally decreases when processing out-of-domain/genre text, like there telegram-style descriptions. At any rate, since the matcher rule matches against building
, it fails to match build
here.
In condo building weird none, building is interpreted as a noun, leading to the lemmatization as building:
>>> [(t.pos_, t.lemma_) for t in doc]
[('NOUN', 'condo'), ('NOUN', 'building'), ('ADJ', 'weird'), ('NOUN', 'none')]
Fascinating, thank you @danieldk ! Is there anything I should consider doing in my setup to control how things lemmatize? For example, I don't think I have any verbs that I'm trying to match against, but I certainly have nouns, adjectives, and maybe other parts of speech.
You can override the lemmatizer rules with exceptions. If you are working with domain-specific data, you could consider adding these systematic errors to the exception table. There is an example in this answer on the discussion board:
https://github.com/explosion/spaCy/discussions/9632#discussioncomment-1595509
Sounds good, thanks @danieldk . I'll close out this bug report.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
I've found that the spacy matcher doesn't match when certain keywords are next to a matched token. This video can explain it best, but here's a text based description. If words such as "none" or "any" are directly after a matched token, the en_core_web_lg and _sm models seem to not match. If that keyword is removed by at least one other token, then things do match. I would expect that the presence of these words would not affect the matching process.
How to reproduce the behaviour
See the video above.
Your Environment