Closed Sebadst closed 8 years ago
Good question. That won't work because the match functionality here [1] doesn't support a regex pattern. An example of how regexes work in spacy is here [2].
@honnibal this is something I'd love to integrate as well, at the matcher level. Ideas on easy ways to do this would be super welcome.
[1] https://github.com/spacy-io/spaCy/blob/a862edc0e66139caf56c6e90ebba6d3f364e35cc/spacy/matcher.pyx#L71 [2] https://github.com/spacy-io/spaCy/blob/cc8bf62208384eedd212f547b70fbf3f3d59eea4/spacy/tokenizer.pyx#L263
It really comes down to adding unlimited quantifiers to the Matcher
patterns, I think. Limited quantifiers and option sets can be compiled into the current automaton.
What if we had something like this:
cdef struct AttrValue:
attr_id_t attr
attr_t value
int quantifier
cdef struct TokenPattern:
const AttrValue* spec
int nr
int quantifier
cdef struct Pattern:
const TokenPattern* spec
int nr
int quantifier
Currently, every token consumes one pattern element, so we always know where we are. This will change, so we'll have to have another layer.
@honnibal could you give a little more context on how the quantifiers and matchers currently work? I'm not sure what you meant by another layer here.
me neither. Another issue that I have, not completely related to this one, is that when I add something to the matcher that can collide (for example I add both 'New York' and 'York') and then in my text I insert the word 'New York', it's always the 'York' entity that is detected, and not the longest one as I wish
@viksit: I figured out a way to do this, and have done much of the initial work now. Needs testing though, and likely there are a lot of edge cases I'm not covering. See here: https://github.com/spacy-io/spaCy/commit/58e83fe34bbec777db588f19723668c2ec431f71
The tests should show some of the usage. Basically I've added four quantifiers:
It would probably be good to move the ! operator to a separate label, so that it can be used in combination with the quantifiers --- it's really an 'invert match', not a proper quantifier.
Possibly we should replace the whole thing with a more principled automaton, so that we can have brackets etc. I think the power-level here is almost right though. I think quantifiers on single tokens will help a lot, particularly since you could match atomic pieces, merge them, and run another matcher. You can also define a match on the start and end tokens, and then use an acceptor function over the intermediate tokens.
@honnibal I like that idea a lot - it's quite elegant! Are you planning to integrate this soon?
Experimental support for this is in 1.0. It needs more testing so it's not advertised in the docs yet, but it's there to play with.
@honnibal Where to add those quantifiers in the pattern syntax? I can't find it in the docs.
I tried this but it was a wild guess:
adjn_matcher.add(
'adjn pl', pl_pat_name, {},
[[{'dep': 'amod'}, {'dep': 'compound', 'quantifier': '*'}, {'tag': 'NNS'}]])
ditto to @adam-ra 's comment
[Edit] Oh Ok:
matcher.add("foofoobar", [{LOWER: 'foo', 'OP' : '+'}, {LOWER: 'bar'}])
Yeah?
Yes, that's correct.
How do I handle these kinds of cases using matcher? E.g.
"AI-123" -> ["AI", "-",
I tried
matcher.add_pattern(entity_key,
[
{spacy.attrs.LOWER: "ai"},
{spacy.attrs.LOWER: '-', 'OP': '?'},
{spacy.attrs.LIKE_NUM: True}
],
label=label)
Is this feature working? looking for the wild card logic - for example: "this is a cat", "this is a wild cat" "this is a really wild cat" (tried it with IS_ALPHA and OP but it doesn't find match) :
matcher.add_pattern("cat", [{LOWER: "this", DEP:"nsubj"}, {LEMMA: "be"}, {IS_ALPHA: True,'OP': '*' },{DEP:"attr"}])
There's an open issue about the use of the '*'
operator in patterns where there's an ambiguity.
I try to bypass it paritally by using + operator (1 or more) but it doesn't work as well...
`text=u"cats not our dogs"
matcher.add_pattern("my regex", [{LOWER: "cats"}, {IS_ALPHA:True, 'OP':'+'}, {LOWER: "dogs"}]) `
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Hello,
I wanted to add accepted patterns for some entities like DATE using regular expressions. I followed the issue in https://github.com/spacy-io/spaCy/issues/475 to add simple words and everything works well, but how can I do it with regex? I tried this for example, but it does not work:
rel_day = ""\d+[/-]\d+[/-]\d+ \d+:\d+:\d+.\d+"" reg3 = re.compile(rel_day, re.IGNORECASE)
nlp.matcher.add("DATE_CUSTOM_ID","DATE",{},[[{ORTH:reg3}]])
I have the following error:
File "spacy/matcher.pyx", line 192, in spacy.matcher.Matcher.add (spacy/matcher.cpp:6633) self.patterns.push_back(init_pattern(self.mem, spec, etype)) File "spacy/matcher.pyx", line 80, in spacy.matcher.init_pattern (spacy/matcher.cpp:3776) pattern[i].spec[j].value = value TypeError: an integer is required