explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
30.1k stars 4.4k forks source link

Named Entity Recognition : get accepted patterns using regex #486

Closed Sebadst closed 8 years ago

Sebadst commented 8 years ago

Hello,

I wanted to add accepted patterns for some entities like DATE using regular expressions. I followed the issue in https://github.com/spacy-io/spaCy/issues/475 to add simple words and everything works well, but how can I do it with regex? I tried this for example, but it does not work:

rel_day = ""\d+[/-]\d+[/-]\d+ \d+:\d+:\d+.\d+"" reg3 = re.compile(rel_day, re.IGNORECASE)

nlp.matcher.add("DATE_CUSTOM_ID","DATE",{},[[{ORTH:reg3}]])

I have the following error:

File "spacy/matcher.pyx", line 192, in spacy.matcher.Matcher.add (spacy/matcher.cpp:6633) self.patterns.push_back(init_pattern(self.mem, spec, etype)) File "spacy/matcher.pyx", line 80, in spacy.matcher.init_pattern (spacy/matcher.cpp:3776) pattern[i].spec[j].value = value TypeError: an integer is required

viksit commented 8 years ago

Good question. That won't work because the match functionality here [1] doesn't support a regex pattern. An example of how regexes work in spacy is here [2].

@honnibal this is something I'd love to integrate as well, at the matcher level. Ideas on easy ways to do this would be super welcome.

[1] https://github.com/spacy-io/spaCy/blob/a862edc0e66139caf56c6e90ebba6d3f364e35cc/spacy/matcher.pyx#L71 [2] https://github.com/spacy-io/spaCy/blob/cc8bf62208384eedd212f547b70fbf3f3d59eea4/spacy/tokenizer.pyx#L263

honnibal commented 8 years ago

It really comes down to adding unlimited quantifiers to the Matcher patterns, I think. Limited quantifiers and option sets can be compiled into the current automaton.

What if we had something like this:


cdef struct AttrValue:
    attr_id_t attr
    attr_t value
    int quantifier

cdef struct TokenPattern:
    const AttrValue* spec
    int nr
    int quantifier

cdef struct Pattern:
    const TokenPattern* spec
    int nr
    int quantifier

Currently, every token consumes one pattern element, so we always know where we are. This will change, so we'll have to have another layer.

viksit commented 8 years ago

@honnibal could you give a little more context on how the quantifiers and matchers currently work? I'm not sure what you meant by another layer here.

Sebadst commented 8 years ago

me neither. Another issue that I have, not completely related to this one, is that when I add something to the matcher that can collide (for example I add both 'New York' and 'York') and then in my text I insert the word 'New York', it's always the 'York' entity that is detected, and not the longest one as I wish

honnibal commented 8 years ago

@viksit: I figured out a way to do this, and have done much of the initial work now. Needs testing though, and likely there are a lot of edge cases I'm not covering. See here: https://github.com/spacy-io/spaCy/commit/58e83fe34bbec777db588f19723668c2ec431f71

The tests should show some of the usage. Basically I've added four quantifiers:

It would probably be good to move the ! operator to a separate label, so that it can be used in combination with the quantifiers --- it's really an 'invert match', not a proper quantifier.

Possibly we should replace the whole thing with a more principled automaton, so that we can have brackets etc. I think the power-level here is almost right though. I think quantifiers on single tokens will help a lot, particularly since you could match atomic pieces, merge them, and run another matcher. You can also define a match on the start and end tokens, and then use an acceptor function over the intermediate tokens.

viksit commented 8 years ago

@honnibal I like that idea a lot - it's quite elegant! Are you planning to integrate this soon?

honnibal commented 8 years ago

Experimental support for this is in 1.0. It needs more testing so it's not advertised in the docs yet, but it's there to play with.

adam-ra commented 7 years ago

@honnibal Where to add those quantifiers in the pattern syntax? I can't find it in the docs.

I tried this but it was a wild guess:

adjn_matcher.add(
    'adjn pl', pl_pat_name, {},
    [[{'dep': 'amod'}, {'dep': 'compound', 'quantifier': '*'}, {'tag': 'NNS'}]])
ghost commented 7 years ago

ditto to @adam-ra 's comment

[Edit] Oh Ok:

matcher.add("foofoobar", [{LOWER: 'foo', 'OP' : '+'}, {LOWER: 'bar'}])

Yeah?

honnibal commented 7 years ago

Yes, that's correct.

bhoomit commented 7 years ago

How do I handle these kinds of cases using matcher? E.g.

"AI-123" -> ["AI", "-", ]

I tried

matcher.add_pattern(entity_key,
                                [
                                    {spacy.attrs.LOWER: "ai"},
                                    {spacy.attrs.LOWER: '-', 'OP': '?'},
                                    {spacy.attrs.LIKE_NUM: True}
                                ],
                                label=label)
eyal13579 commented 7 years ago

Is this feature working? looking for the wild card logic - for example: "this is a cat", "this is a wild cat" "this is a really wild cat" (tried it with IS_ALPHA and OP but it doesn't find match) :

matcher.add_pattern("cat", [{LOWER: "this", DEP:"nsubj"}, {LEMMA: "be"}, {IS_ALPHA: True,'OP': '*' },{DEP:"attr"}])

honnibal commented 7 years ago

There's an open issue about the use of the '*' operator in patterns where there's an ambiguity.

eyal13579 commented 7 years ago

I try to bypass it paritally by using + operator (1 or more) but it doesn't work as well...

`text=u"cats not our dogs"

matcher.add_pattern("my regex", [{LOWER: "cats"}, {IS_ALPHA:True, 'OP':'+'}, {LOWER: "dogs"}]) `

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.