gandersen101 / spaczz

Fuzzy matching and more functionality for spaCy.
MIT License
252 stars 27 forks source link

RegexMatcher: Match Captures? #64

Closed karrtikiyer closed 1 year ago

karrtikiyer commented 3 years ago

I am able to get this regex working using below code.

import spacy
from spaczz.matcher import RegexMatcher

nlp = spacy.blank("en")
text = "Hello how are you? Proficiency in ETL tools like Informatica, Talend, Alteryx and Visualization tools like PowerBi, Tableau and Qlikview"
doc = nlp(text)

matcher = RegexMatcher(nlp.vocab)
matcher.add(
    "SKILL",
    [
        r"""(?i)proficiency in ([\w\s]+) tools like (.*$)"""
    ],
)  
matches = matcher(doc)

for match_id, start, end, counts in matches:
    print(match_id, doc[start:end], counts)

And I get the matched sentence as output as expected. However I am unsure if there is way I can get access to the match capture ([\w\s]+) & (.*$). Looking for any suggestions or advise. Once I get matched result/sentence, I would like to access the match captures ETL and Informatica, Talend, Alteryx and Visualization tools like PowerBi, Tableau and Qlikview.

gandersen101 commented 3 years ago

Hi @karrtikiyer. Spaczz's regex matching essentially just extends on the spaCy's docs own recommendations for applying regex to a full text. You can see the same ideas implemented in spaczz's RegexSearcher.match() method. As of now, spaczz only captures the regex match start and end token positions and fuzzy counts. However, it would be relatively simple to capture additional regex information and return that as well.

I have to think about whether there is a consistent way I can work regex capture data into the spaczz API. I am planning on doing a feature overhaul/upgrade on spaczz in the near future so I will keep this request in mind.

karrtikiyer commented 3 years ago

Thanks a lot @gandersen101