chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

what are the `extract.matches` patterns analogous to `constants.POS_REGEX_PATTERNS` ? #318

Closed joefromct closed 3 years ago

joefromct commented 3 years ago

what's wrong?

textacy.constants has POS_REGEX_PATTERNS:

textacy.constants.POS_REGEX_PATTERNS['en']['NP'] 

'<DET>? <NUM>* (<ADJ> <PUNCT>? <CONJ>?)* (<NOUN>|<PROPN> <PART>?)+'

What is the equivalent pattern with the newer textacy.extract.matches pattern format?

Should it be in the constants file?

relevant page or section

https://github.com/chartbeat-labs/textacy/blob/cdedd2351bf2a56e8773ec162e08c3188809d486/src/textacy/constants.py#L86

bdewilde commented 3 years ago

Hi @joefromct , thanks for reporting. This constant was originally used by the pos_regex_matches() function, but that function has been deprecated for a while and (on the develop branch) removed entirely. So, I should remove it.

I don't think there's any way to exactly replicate this functionality using spaCy's Matcher, but you can get pretty close:

from spacy.matcher import Matcher
pattern = [
    {'POS': 'DET', 'OP': '?'},
    {'POS': 'NUM', 'OP': '*'},
    {'POS': 'ADJ', 'OP': '*'},
    {'POS': {'IN': ['NOUN', 'PROPN'], 'OP': '+'}
]
matcher = Matcher(nlp.vocab)
matcher.add("np", [pattern])