chenkovsky / cyac

High performance Trie and Ahocorasick automata (AC automata) Keyword Match & Replace Tool for python. Correct case insensitive implementation!
MIT License
94 stars 15 forks source link

Overlap matches #2

Closed OblackatO closed 4 years ago

OblackatO commented 4 years ago

I just want to point out that this library does not support overlap matches. Consider the following:

import cyac
ac = cyac.AC.build([u"hello@gmail.comhi", u"gmail.com"])
var1 = "gmailhello@gmail.comhiaa"
for id, start, end in ac.match(var1):
    print(var1[start:end])

Outputs: hello@gmail.comhi

It should have output: hello@gmail.comhi and gmail.com

This might also be one of the reasons why this lib is faster than pyahocorasick

nppoly commented 4 years ago

@OblackatO thank you for your feedback!!!! my original intention is implementing a library used for keyword extraction. I don't need sub keywords. if it's important for you. https://github.com/nppoly/cyac/pull/3 I committed the patch. you can review it. if no problem. I will merge it.

OblackatO commented 4 years ago

This issue has been solved in #3 , thank you @nppoly. Closing this issue.