WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
952 stars 125 forks source link

How to remove matchings that could not align word boundary? #170

Open zwd2080 opened 2 years ago

zwd2080 commented 2 years ago

The second matching (5, 'her' ) and the last one (14, 'she') are not aliging the word boundary, how to remove them ? or could we force them to mathcing word?

 for idx, key in enumerate('he her hers she'.split()):
    A.add_word(key,  key) # 
 A.make_automaton()
 needle = "he here her shes"
 list(A.iter_long(needle))
# [(1, 'he'), (5, 'her'), (10, 'her'), (14, 'she')]
pombredanne commented 1 year ago

Are you saying that you only want to have whole words matched? If so then you do not want to add strings characters as words, but rather sequence of words converted to numbers, otherwise the automaton will be on characters and will match characters: it does not know anything about words.

donatoaz commented 1 year ago

Hi @pombredanne just to make sure I understand: the idea is that each unique word in the needles would map to a distinct int and we'd add these ints as keys and the words as the values?

Do you have a recommendation for this mapping? since the haystack will also need to mapped prior to iterating it with the same resulting map.

Thanks!

explrA commented 1 year ago

@pombredanne

Can we get more info on this please. I want exact(whole) word match and I am not able to understand how to approach it. Any insights would be greatly appreciated

Thanks