WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
927 stars 122 forks source link

'iter_long' misses the substring match when the overlapped pattern doesn't match #133

Closed twang18 closed 1 year ago

twang18 commented 3 years ago

Hi, I did a bulk of matches with 'iter_long', and most of the time it worked well. However when the overlapped pattern doesn't match, the inclusive substring won't be able to match.

source_text = 'abc'
A = ahocorasick.Automaton()
for k in ['b', 'c', 'abd']:
    A.add_word(k, k)
A.make_automaton()

list(A.iter_long(source_text))
 [(2, 'c')]

which I think should match [(1, 'b')] as well.

Similarly for Chinese characters,

source_text = '国家知识产权'
A = ahocorasick.Automaton()
for k in ['知识产权', '国家知识产权局']:
    A.add_word(k, k)
A.make_automaton()
list(A.iter_long(source_text))
[]

In this case, [(5, '知识产权')] is supposed to be matched.

Is this a bug of 'iter_long' or just it was designed to behave like this? Thanks!

pombredanne commented 2 years ago

@luzm3 ping ... #138 was closed in favor of this

junshipeng commented 2 years ago

I have the same problem, how to solve it

JIANZM commented 2 years ago

I found the same problem.

robinchm commented 2 years ago

@WojciechMula Hi, any chance of fixing this bug? This is quite severe because now the algorithm output is simply incorrect. I can give another example to reliably trigger this behavior for pyahocorasick 1.4.4 (also tested in 2.0.0b1 the bug persists):

import ahocorasick

a = ahocorasick.Automaton()
a.add_word("bc", "bc")
a.add_word("abcdefg", "abcdefg")
a.make_automaton()

text = "abcxybc"
for i, s in a.iter_long(text):
    print(i, s)

This snippet should give two lines of output, but instead it only outputs the second "bc".

pombredanne commented 2 years ago

@robinchm I am not sure we are talking about the same feature... iter_long will return the longest of two matches starting at the same position.... It never returns partial matches: this is not a feature of the library at all. This works fine for me:

>>> import ahocorasick
>>> a = ahocorasick.Automaton()
>>> a.add_word("bc", "bc")
True
>>> a.add_word("abcdefg", "abcdefg")
True
>>> a.add_word("abcx", "abcx")
True
>>> a.make_automaton()
>>> text = "abcxybc"
>>> for i, s in a.iter_long(text):
...     print(i, s)
... 
3 abcx
6 bc
pombredanne commented 2 years ago

@robinchm my mistake... this is indeed an issue:

>>> import ahocorasick
>>> a = ahocorasick.Automaton()
>>> a.add_word("bc", "bc")
True
>>> a.add_word("abcdefg", "abcdefg")
True
>>> a.add_word("abc x", "abc x")
True
>>> a.make_automaton()
>>> text = "abcxybc"
>>> for i, s in a.iter_long(text):
...     print(i, s)
... 
6 bc # <==incorrect is missing 2 bc
>>> a = ahocorasick.Automaton()
>>> a.add_word("bc", "bc")
True
>>> a.add_word("abcdefg", "abcdefg")
True
>>> a.add_word("abc x", "abc x")
True
>>> a.make_automaton()
>>> text = "abcxybc"
>>> for i, s in a.iter(text):
...     print(i, s)
... 
2 bc
6 bc # <==correct
robinchm commented 2 years ago

@pombredanne I've put up a simpler example to reproduce this issue.

automaton = ahocorasick.Automaton()
automaton.add_word("b", "b")
automaton.add_word("abc", "abc")
automaton.make_automaton()

text = "abb"
for i, s in automaton.iter_long(text):
    print(i, s)

The correct output should be 1, b; 2, b, while the actual output is only 2, b. I am looking into the source code, but struggling as I have little experience in python-C binding.

robinchm commented 2 years ago

https://github.com/WojciechMula/pyahocorasick/blob/0bd7daa76b7ef7c7146eda86aa3e23b7341c0871/src/AutomatonSearchIterLong.c#L118-L125

I think it's at this if block, when the next node is not eow, it's simply skipped. Instead it should output the result pointed by the "dictionary suffix" of the next node if that exists. To achieve this there needs to be an else block that returns immediately when matching failed, using the "dictionary suffix" of the next node (which is next->fail I guess?).

I attempted a fix #173 , @pombredanne mind to take a look? May very well be buggy since I don't fully understand how this library works.

robinchm commented 2 years ago

nvm, the fix is incorrect.

robinchm commented 2 years ago

Well, put up another fix with more tests. #174 @pombredanne do you mind to take a look? Thanks.

vegarden commented 1 year ago

@pombredanne @robinchm After the fix, in version 2.0.0, I found the code below

source_text = 'zzabcabdzz'
A = ahocorasick.Automaton()
for k in ['ab', 'abcabd']:
    A.add_word(k, k)

A.make_automaton()

list(A.iter_long(source_text))

matches [(6, 'ab')], but it matches [(7, 'abcabd')] in version 1.4.4. I think 1.4.4 is right in this situation.

norths7ar commented 1 year ago

same here chemicals = ['trimethoprim', 'sulfamethoxazole', 'meth'] A = ahocorasick.Automaton() for word in chemicals: A.add_word(word, word) A.make_automaton() sentence = 'sulfamethoxazole and trimethoprim' print(list(A.iter_long(sentence)))

The output is [(8, 'meth'), (27, 'meth')], which I believe should be 'sulfamethoxazole' and 'trimethoprim'.