WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
914 stars 122 forks source link

new 'iter_long' match misses after the recent fix #185

Open vegarden opened 1 year ago

vegarden commented 1 year ago

After fix #174 which is discussed in #133. I found the code below

source_text = 'zzabcabdzz'
A = ahocorasick.Automaton()
for k in ['ab', 'abcabd']:
    A.add_word(k, k)

A.make_automaton()

list(A.iter_long(source_text))

matches [(6, 'ab')], but it matches [(7, 'abcabd')] in version 1.4.4. The behavior before seems to be right in this situation.

doctorink commented 1 year ago

This happens for me as well. 1.4.4 worked fine, 2.0.0 misses the longest match!

robinchm commented 1 year ago

This is indeed a bug introduced by a previous fix for iter_long missing legitimate short matches. I attempt a fix #186 and add more test cases, but cross my fingers for not introducing more bugs. Can someone help review the pr? @pombredanne Meanwhile it would be nice if anyone can test my branch and report success/bugs.

The idea is before you fail over to the suffix node, perform a check to see whether there is possibly longer match ahead the current node. If there is a longer match, we should give up the suffix failover.

robinchm commented 1 year ago

ehhh, it does not fully work, since it's not enough to just look ahead one step. You need to look ahead all possible longer matches before deciding whether to fail over. Efficiency-wise that's not acceptable. Give me some time to think it over, I hope it does not require a overhaul to fully resolve this bug.

robinchm commented 1 year ago

187 Come up with another fix, hopefully it works this time. Feedback welcomed.

Note that in case of two partially overlapping strings it can't return the longest match, but the first match. @pombredanne mind to review?

djstrong commented 8 months ago

Anyone tested this fix?

AyanSinhaMahapatra commented 3 months ago

Just tested https://github.com/WojciechMula/pyahocorasick/pull/187, looks like it's working for me from this branch:

import  ahocorasick

source_text = b'zzabcabdzz'
A = ahocorasick.Automaton()
for k in [b'ab', b'abcabd']:
    A.add_word(k, k)

A.make_automaton()

list(A.iter_long(source_text))

Returns [(7, b'abcabd')]

But I get this test failure with make valgrind:

___________________ MemoryUsageDoesNotGrow.test_memory_usage_does_not_grow ___________________

self = <test_issue_9.MemoryUsageDoesNotGrow testMethod=test_memory_usage_does_not_grow>

    def test_memory_usage_does_not_grow(self):

        ac = build_automaton()

        here = Path(__file__).parent
        with open(here.parent / 'README.rst') as f:
            data = f.read()[:1024 * 2]
            data = conv(data)

        before = get_memory_usage()

        for _ in range(1000):
            for start in range(0, len(data) - 20):
                ac.iter(data, start)

        after = get_memory_usage()
>       assert before == after
E       assert 300244.0 == 300964.0

tests/test_issue_9.py:64: AssertionError
================================== short test summary info ===================================
FAILED tests/test_issue_9.py::MemoryUsageDoesNotGrow::test_memory_usage_does_not_grow - assert 300244.0 == 300964.0

Tests are passing otherwise: https://github.com/AyanSinhaMahapatra/pyahocorasick/actions/runs/8371293380/job/22920079498

pombredanne commented 3 months ago

@robinchm your fix in #187 works but seems to cause a leak.