chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

pos_regex_matches vs matches different behavior #290

Closed loricelli closed 3 years ago

loricelli commented 4 years ago

I'm trying to find verbs in a sentence with python for a NLP problem but the new function returns any match and not only the longest match (which pos_regex_matches does).

pattern = r'<VERB>*<ADV>*<VERB>+<PART>*'
verb_pattern = [{"POS": "VERB", "OP": "*"},{"POS": "ADV", "OP": "*"},{"POS": "VERB", "OP": "+"},{"POS": "PART", "OP": "*"}]

t_list_1 = textacy.extract.pos_regex_matches(text, pattern)
t_list_2 = textacy.extract.matches(text, verb_pattern)

As you can see the pattern is the same, but the matches function's one is in the new format. The old pos_regex_matches returns, for example, was celebrating while the new matches returns both was and was celebrating. Is this intended to work this way?

bdewilde commented 4 years ago

Hi @loricelli , which version of spaCy are you using? I know that they had several issues with this in previous versions, but as far as I know it's since been resolved.

loricelli commented 4 years ago

I'm currently working on Google CoLab with spacy 2.1.0 and textacy 0.9.1. Now that you mentioned it, I tried the same script on my laptop with spacy 2.2.3 and textacy 0.9.1 and the results were the same. This is a bit of code I wrote as a test.

d = "You no longer need to be at the station on sundays."
nlp_t = nlp(d)
t_list_0 = textacy.extract.pos_regex_matches(nlp_t, pattern)
t_list_1 = textacy.extract.matches(nlp_t, verb_pattern)

The elements in t_list_0 (pos_regex_matches) are: no longer need to

The elements in t_list_1 (the new matches function) are:

no longer need longer need need no longer need to longer need to need to

Am I missing something? For now I'm using the deprecated one despite the warnings.

bdewilde commented 4 years ago

Hey @loricelli , thanks for your patience. I agree with you that the behavior of spacy's Matcher here is... not quite what we want. However, it looks like this is the "expected" behavior: https://github.com/explosion/spaCy/issues/4627

I could probably add a function arg like greedy=True or something that would only take the longest matches and discard all contained matches. Would that work for you?

bdewilde commented 4 years ago

Hi again, here's a straightforward solution that I could add to the function such that it only yields matches that aren't contained entirely within a longer match:

>>> matches_ = list(textacy.extract.matches(doc, verb_pattern))
>>> matches_
[no longer need, longer need, need, no longer need to, longer need to, need to]
>>> seen_se = set()
>>> filtered_matches = []
>>> for match in sorted(matches_, key=len, reverse=True):
...     s, e = match.start, match.end
...     if any(s >= ms and e <= me for ms, me in seen_se):
...         continue
...     else:
...         seen_se.add((s, e))
...         filtered_matches.append(match)
>>> sorted(filtered_matches, key=lambda m: m.start)
[no longer need to]

Is this the behavior you want?