WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
929 stars 122 forks source link

Maybe a bug? #56

Closed ecolss closed 6 years ago

ecolss commented 7 years ago

Build an AC automaton on top of tens of thousands of keywords, and then I did the following search:

In [28]: m.get('poke')
Out[28]: (4, 'poke')

In [29]: m.get('go')
Out[29]: (2, 'go')

In [30]: m.find_all('pokego', print)
0 (1, 'p')
1 (2, 'po')
1 (1, 'o')
2 (3, 'pok')
2 (2, 'ok')
2 (1, 'k')
3 (4, 'poke')
3 (3, 'oke')
3 (2, 'ke')
3 (1, 'e')

I expect find_all to return match for 'go', but it didn't.

WojciechMula commented 7 years ago

Thank you for the report. Which version of the library do you use? Windows/Linux? Python 2/Python 3?

ecolss commented 7 years ago

Tested on Mac and Linux, Py3.6/2.7.

I found that if the keyword set contains only a few words, it works as expected. But if contains half a million words, it failed to return 'go'.

WojciechMula commented 7 years ago

@ecolss I tried to reproduce the bug (https://github.com/WojciechMula/pyahocorasick/blob/master/unresolved_bugs/bug_56.py), but it works correctly for both py3 and py2.

I loaded all words from an English dictionary and added to the automaton the words and their subwords. Here is output from the script I linked:

Adding words...
742172 words added
Building an automaton...
Test
(0, 'p')
(1, 'po')
(1, 'o')
(2, 'pok')
(2, 'ok')
(2, 'k')
(3, 'poke')
(3, 'oke')
(3, 'ke')
(3, 'e')
(4, 'keg')
(4, 'eg')
(4, 'g')
(5, 'ego')
(5, 'go')
(5, 'o')

Could you please provide the list of words you use? Or a script which produce it?

ecolss commented 7 years ago
In [91]: ac.clear()

In [92]: ac.add_word('poke', 'poke')
Out[92]: True

In [93]: ac.add_word('go', 'go')
Out[93]: True

In [94]: ac.add_word('pokegois', 'pokegois')
Out[94]: True

In [95]: ac.make_automaton()

In [97]: ac.find_all('pokego', fn)
(3, 'poke')
WojciechMula commented 7 years ago

@ecolss Thank you! Now I'm tackling the problem.

woakesd commented 7 years ago

Changes work well here on Windows Python 3.5 64 bit.