WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
914 stars 122 forks source link

Prepare release 2.0.0.b1 #166

Closed pombredanne closed 2 years ago

pombredanne commented 2 years ago

This PR is to prepare a 2.0.0.b1 release. There is no major code changes, but a lot of code reorganization:

Note that the PR could be split in many smaller ones if this is too much to swallow at once. Also: there are failing tests, but these were failing before... but we did not knew because we never ran these, so I would not consider this as being in the way of a release.

Signed-off-by: Philippe Ombredanne pombredanne@nexb.com

pombredanne commented 2 years ago

@chaitanya710 ... there are a few tests failing in this PR. Could you check what is the problem and why these are failing? Could the test be wrong? Or why do we have such a regression? If you have a fix, can you push a branch derived from this branch with a possible fix? Thanks!

pombredanne commented 2 years ago

@chaitanya710 gentle ping

pombredanne commented 2 years ago

We have these failures on Windows only, unicode builds all pythons:

=================================== FAILURES ===================================
______________ test_basic_items_keys_and_values_with_bytes_build _______________

    @skipIf(ahocorasick.unicode, "Run only with bytes build")
    def test_basic_items_keys_and_values_with_bytes_build():
        automaton = ahocorasick.Automaton()
        words = b"he e hers his she hi him man he".split()
        #         0 1    2   3   4  5   6   7  8
        for i, w in enumerate(words):
            automaton.add_word(w, (i, w))

        expected_keys = [b'e', b'hers', b'his', b'she', b'hi', b'him', b'man', b'he', ]
        expected_values = [
            (1, b'e'),
            (2, b'hers'),
            (3, b'his'),
            (4, b'she'),
            (5, b'hi'),
            (6, b'him'),
            (7, b'man'),
            (8, b'he'),
        ]

        assert sorted(automaton.keys()) == sorted(expected_keys)
        assert sorted(automaton.values()) == sorted(expected_values)
>       assert sorted(dict(automaton.items()).values()) == sorted(expected_values)
E       AssertionError: assert [(1, b'e'), (2, b'hers'), (3, b'his'), (4, b'she'), (7, b'man'), (8, b'he')] == [(1, b'e'),\n (2, b'hers'),\n (3, b'his'),\n (4, b'she'),\n (5, b'hi'),\n (6, b'him'),\n (7, b'man'),\n (8, b'he')]
E         At index 4 diff: (7, b'man') != (5, b'hi')
E         Right contains 2 more items, first extra item: (7, b'man')
E         Full diff:
E           [
E            (1,
E             b'e'),
E            (2,
E             b'hers'),
E            (3,
E             b'his'),
E            (4,
E             b'she'),
E         -  (5,
E         -   b'hi'),
E         -  (6,
E         -   b'him'),
E            (7,
E             b'man'),
E            (8,
E             b'he'),
E           ]

tests/test_basic.py:71: AssertionError
_________________________ TestTrieIterators.test_items _________________________

self = <test_unit.TestTrieIterators testMethod=test_items>

    def test_items(self):
        A = self.A
        I = []
        for i, w in enumerate(self.words):
            A.add_word(conv(w), i + 1)
            I.append((conv(w), i + 1))

        L = [x for x in A.items()]
        self.assertEqual(len(L), len(I))
>       self.assertEqual(set(L), set(I))
E       AssertionError: Items in the first set but not the second:
E       (b'w\x00o\x00', 1)
E       (b'a\x00h', 3)
E       (b'p\x00y\x00t\x00', 2)
E       (b'c\x00o\x00r\x00a\x00', 4)
E       Items in the second set but not the first:
E       (b'word', 1)
E       (b'python', 2)
E       (b'corasick', 4)
E       (b'aho', 3)

tests/test_unit.py:431: AssertionError
=========================== short test summary info ============================
FAILED tests/test_basic.py::test_basic_items_keys_and_values_with_bytes_build
FAILED tests/test_unit.py::TestTrieIterators::test_items - AssertionError: It...
=================== 2 failed, 156 passed, 9 skipped in 1.51s ===================
pombredanne commented 2 years ago

And these failures with a bytes build on all OSes and all Pythons:

=================================== FAILURES ===================================
______________ test_basic_items_keys_and_values_with_bytes_build _______________

    @skipIf(ahocorasick.unicode, "Run only with bytes build")
    def test_basic_items_keys_and_values_with_bytes_build():
        automaton = ahocorasick.Automaton()
        words = b"he e hers his she hi him man he".split()
        #         0 1    2   3   4  5   6   7  8
        for i, w in enumerate(words):
            automaton.add_word(w, (i, w))

        expected_keys = [b'e', b'hers', b'his', b'she', b'hi', b'him', b'man', b'he', ]
        expected_values = [
            (1, b'e'),
            (2, b'hers'),
            (3, b'his'),
            (4, b'she'),
            (5, b'hi'),
            (6, b'him'),
            (7, b'man'),
            (8, b'he'),
        ]

        assert sorted(automaton.keys()) == sorted(expected_keys)
        assert sorted(automaton.values()) == sorted(expected_values)
>       assert sorted(dict(automaton.items()).values()) == sorted(expected_values)
E       AssertionError: assert [(1, b'e'), (2, b'hers'), (3, b'his'), (4, b'she'), (7, b'man'), (8, b'he')] == [(1, b'e'),\n (2, b'hers'),\n (3, b'his'),\n (4, b'she'),\n (5, b'hi'),\n (6, b'him'),\n (7, b'man'),\n (8, b'he')]
E         At index 4 diff: (7, b'man') != (5, b'hi')
E         Right contains 2 more items, first extra item: (7, b'man')
E         Full diff:
E           [
E            (1,
E             b'e'),
E            (2,
E             b'hers'),
E            (3,
E             b'his'),
E            (4,
E             b'she'),
E         -  (5,
E         -   b'hi'),
E         -  (6,
E         -   b'him'),
E            (7,
E             b'man'),
E            (8,
E             b'he'),
E           ]

tests/test_basic.py:71: AssertionError
_________________________ TestTrieIterators.test_items _________________________

self = <test_unit.TestTrieIterators testMethod=test_items>

    def test_items(self):
        A = self.A
        I = []
        for i, w in enumerate(self.words):
            A.add_word(conv(w), i + 1)
            I.append((conv(w), i + 1))

        L = [x for x in A.items()]
        self.assertEqual(len(L), len(I))
>       self.assertEqual(set(L), set(I))
E       AssertionError: Items in the first set but not the second:
E       (b'w\x00o\x00', 1)
E       (b'p\x00y\x00t\x00', 2)
E       (b'c\x00o\x00r\x00a\x00', 4)
E       (b'a\x00h', 3)
E       Items in the second set but not the first:
E       (b'python', 2)
E       (b'corasick', 4)
E       (b'word', 1)
E       (b'aho', 3)

tests/test_unit.py:431: AssertionError
=========================== short test summary info ============================
FAILED tests/test_basic.py::test_basic_items_keys_and_values_with_bytes_build
FAILED tests/test_unit.py::TestTrieIterators::test_items - AssertionError: It...

Weirdly enough these are the same failure than on Windows unicode builds...

pombredanne commented 2 years ago

Here is the tally:

self =

@unittest.skipIf(not on_linux, "Works only on linux")
def test_does_not_leak_memory(self):
    before = get_memory_usage()
    use_memory()
    after = get_memory_usage()
  assert before == after

E assert 116672.0 == 116928.0 E +116672.0 E -116928.0

tests/unresolved_bugs/test_bug_81.py:66: AssertionError

__ test_basic_items_keys_and_values_with_bytes_build ___

@skipIf(ahocorasick.unicode, "Run only with bytes build")
def test_basic_items_keys_and_values_with_bytes_build():
    automaton = ahocorasick.Automaton()
    words = b"he e hers his she hi him man he".split()
    #         1  2 3    4   5   6  7   8   9
    for i, w in enumerate(words, 1):
        automaton.add_word(w, (i, w))

    expected_keys = [b'e', b'hers', b'his', b'she', b'hi', b'him', b'man', b'he', ]
    expected_values = [
        # the second addition munges this by design, like in a dict (1, b'he'),
        (2, b'e'),
        (3, b'hers'),
        (4, b'his'),
        (5, b'she'),
        (6, b'hi'),
        (7, b'him'),
        (8, b'man'),
        (9, b'he'),
    ]

    expected_items = [
        # the second addition munges this by design, like in a dict  (b'he', (1, b'he')),
        (b'e', (2, b'e')),
        (b'hers', (3, b'hers')),
        (b'his', (4, b'his')),
        (b'she', (5, b'she')),
        (b'hi', (6, b'hi')),
        (b'him', (7, b'him')),
        (b'man', (8, b'man')),
        (b'he', (9, b'he')),
    ]

    assert sorted(automaton.keys()) == sorted(expected_keys)
    assert sorted(automaton.values()) == sorted(expected_values)

E - (b'corasick', 4), E - (b'python', 2), E - (b'word', 1), E + (b'c\x00o\x00r\x00a\x00', 4), E + (b'p\x00y\x00t\x00', 2), E + (b'w\x00o\x00', 1), E ]

tests/test_unit.py:433: AssertionError =========================== short test summary info ============================ FAILED tests/test_basic.py::test_basic_items_keys_and_values_with_bytes_build FAILED tests/test_unit.py::TestTrieIterators::test_items - AssertionError: as...

chaitanya710 commented 2 years ago

@chaitanya710 ... there are a few tests failing in this PR. Could you check what is the problem and why these are failing? Could the test be wrong? Or why do we have such a regression? If you have a fix, can you push a branch derived from this branch with a possible fix? Thanks!

@pombredanne sure I will check it and see what I can do

pombredanne commented 2 years ago

@chaitanya710 thanks @WojciechMula unless you have an objection I will merge and tag a beta release. The few pending issues existed before anyway and can be fixed afterwards

pombredanne commented 2 years ago

BTW, I do not think we have a memory leak. Rather the test was incorrect and is now corrected. We have an issue wrt. the Automaton.items() function on Windows and also for all bytes builds. See https://github.com/WojciechMula/pyahocorasick/blob/3e9478411ff361d4c3f274f01e8709bbd871caff/tests/unresolved_bugs/test_bug_81.py

pombredanne commented 2 years ago

I am merging and pushing a beta release now.