Closed ChristosChristofidis closed 6 years ago
The Aho-Corasick algorithm allows you to find all occurrences of words from a list. So I think it may help in your problem.
However, I'm pretty sure it won't solve problem of ambiguous strings. What about: "thereare..." -> "there are..."/"the rear e..."?
You will get positions of all words from your list, but later you need to select one of many valid options. Although locating words is done in linear time, I guess the whole algorithm will have worse complexity.
Thank you for your answer.
That kind of ambiguity will not be a major problem in my case I think . Can it also match substrings instead of just prefixes? Say I search for 'hi' in a list of strings. Can I get back the word "this" or "matching"? Right now If I search 'hi' I will get something like "hi","hill" ,"hip","hixxx" ect
Is something like that possible?
Automaton searches for all words you add to it, in order to recognize "this" and "matching" you must add these words. Here is a sample program:
import ahocorasick
# build automaton
A = ahocorasick.Automaton()
for index, word in enumerate('hi this matching ching'.split()):
A.add_word(word, (index, word))
A.make_automaton()
# search words
for item in A.iter('this matching'):
print(item)
It prints:
(2, (0, 'hi'))
(3, (1, 'this'))
(10, (0, 'hi'))
(12, (2, 'matching'))
(12, (3, 'ching'))
As you see all words were found, even overlapped.
On the same aspect, any built-in option to avoid matching search words appearing as sub-words?
@matanster There's no such option. AFAIK automaton is not able to distinguish a proper word from its sub-words.
Could aho corasick (this module in particular) be used for such task ? I am asking as I have not previously used said algorithm and I didn't find an example showcasing this use.
For Example:
"hellotherehowareyou" --> "hello there how are you" Now with dynamic programing you could do this in O(N^2) . . . can we do better ? aho corasick seems like the way to go?