Open vegarden opened 1 year ago
This happens for me as well. 1.4.4 worked fine, 2.0.0 misses the longest match!
This is indeed a bug introduced by a previous fix for iter_long
missing legitimate short matches.
I attempt a fix #186 and add more test cases, but cross my fingers for not introducing more bugs. Can someone help review the pr? @pombredanne
Meanwhile it would be nice if anyone can test my branch and report success/bugs.
The idea is before you fail over to the suffix node, perform a check to see whether there is possibly longer match ahead the current node. If there is a longer match, we should give up the suffix failover.
ehhh, it does not fully work, since it's not enough to just look ahead one step. You need to look ahead all possible longer matches before deciding whether to fail over. Efficiency-wise that's not acceptable. Give me some time to think it over, I hope it does not require a overhaul to fully resolve this bug.
Note that in case of two partially overlapping strings it can't return the longest match, but the first match. @pombredanne mind to review?
Anyone tested this fix?
Just tested https://github.com/WojciechMula/pyahocorasick/pull/187, looks like it's working for me from this branch:
import ahocorasick
source_text = b'zzabcabdzz'
A = ahocorasick.Automaton()
for k in [b'ab', b'abcabd']:
A.add_word(k, k)
A.make_automaton()
list(A.iter_long(source_text))
Returns [(7, b'abcabd')]
But I get this test failure with make valgrind
:
___________________ MemoryUsageDoesNotGrow.test_memory_usage_does_not_grow ___________________
self = <test_issue_9.MemoryUsageDoesNotGrow testMethod=test_memory_usage_does_not_grow>
def test_memory_usage_does_not_grow(self):
ac = build_automaton()
here = Path(__file__).parent
with open(here.parent / 'README.rst') as f:
data = f.read()[:1024 * 2]
data = conv(data)
before = get_memory_usage()
for _ in range(1000):
for start in range(0, len(data) - 20):
ac.iter(data, start)
after = get_memory_usage()
> assert before == after
E assert 300244.0 == 300964.0
tests/test_issue_9.py:64: AssertionError
================================== short test summary info ===================================
FAILED tests/test_issue_9.py::MemoryUsageDoesNotGrow::test_memory_usage_does_not_grow - assert 300244.0 == 300964.0
Tests are passing otherwise: https://github.com/AyanSinhaMahapatra/pyahocorasick/actions/runs/8371293380/job/22920079498
@robinchm your fix in #187 works but seems to cause a leak.
After fix #174 which is discussed in #133. I found the code below
matches [(6, 'ab')], but it matches [(7, 'abcabd')] in version 1.4.4. The behavior before seems to be right in this situation.