explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.94k stars 4.39k forks source link

Regex doesn't work if less than 3 characters? #13264

Closed SHxKM closed 9 months ago

SHxKM commented 9 months ago

How to reproduce the behaviour

Taken and adjusted right from the docs:

import spacy
from spacy.matcher import Matcher

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab, validate=True)
pattern = [
            {
                "TEXT": {
                    "regex": r"4K"
                }
            }
        ]
matcher.add("TV_RESOLUTION", [pattern])
doc = nlp("Sony 55 Inch 4K Ultra HD TV X90K Series:BRAVIA XR LED Smart Google TV, Dolby Vision HDR, Exclusive Features for PS 5 XR55X90K-2022 w/HT-A5000 5.1.2ch Dolby Atmos Sound Bar Surround Home Theater")
res = matcher(doc)

# res = []

However if I add a D after 4K in both strings, a match is found. Is there a minimal length restriction?

import spacy
from spacy.matcher import Matcher

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab, validate=True)
pattern = [
            {
                "TEXT": {
                    "regex": r"4KD"
                }
            }
        ]
matcher.add("TV_RESOLUTION", [pattern])
doc = nlp("Sony 55 Inch 4KD Ultra HD TV X90K Series:BRAVIA XR LED Smart Google TV, Dolby Vision HDR, Exclusive Features for PS 5 XR55X90K-2022 w/HT-A5000 5.1.2ch Dolby Atmos Sound Bar Surround Home Theater")
res = matcher(doc)

# res = [[(11960903833032025891, 3, 4)]]

Your Environment

SHxKM commented 9 months ago

Doesn't seem to be exclusive to regex:

{
  "LOWER": "4k"
}

Doesn't work either.

SHxKM commented 9 months ago

This SO answer is what I was after:

In Spacy 2.3.2, 1 1/2-inch is tokenized as ('1', 'NUM'), ('1/2-inch', 'NUM'), so there will be no match with your current patterns if you do not introduce a new, specific pattern.

github-actions[bot] commented 8 months ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.