Prepare release 2.0.0.b1

pombredanne commented 2 years ago

This PR is to prepare a 2.0.0.b1 release. There is no major code changes, but a lot of code reorganization:

all C sources are under src/
all tests are under tests/ and have been updated to use pytest as a test runner. More tests have been added and all tests are designed to run on unicode and bytes builds.
all misc code has been moved to etc/
the Makefile has been adjusted to use a proper virtualenv and pytest to run tests
a new CI now builds binaries and tests the builds for both unicode and bytes on all 60 OSes and Python version combos (a test matrix with 2 versions each of Linux, macOS and Windows, 2 builds e.g.,, unicode or bytes, and 5 python version from 3.6 to 3.10) plus extra tests on each wheels builds on multiple OSes and containers.
there is no explicit support for Python 2 listed in the code and tests, though Python 2 support has not been removed it is unlikely to work

Note that the PR could be split in many smaller ones if this is too much to swallow at once. Also: there are failing tests, but these were failing before... but we did not knew because we never ran these, so I would not consider this as being in the way of a release.

Signed-off-by: Philippe Ombredanne pombredanne@nexb.com

pombredanne commented 2 years ago

@chaitanya710 ... there are a few tests failing in this PR. Could you check what is the problem and why these are failing? Could the test be wrong? Or why do we have such a regression? If you have a fix, can you push a branch derived from this branch with a possible fix? Thanks!

pombredanne commented 2 years ago

@chaitanya710 gentle ping

pombredanne commented 2 years ago

We have these failures on Windows only, unicode builds all pythons:

=================================== FAILURES ===================================
______________ test_basic_items_keys_and_values_with_bytes_build _______________

    @skipIf(ahocorasick.unicode, "Run only with bytes build")
    def test_basic_items_keys_and_values_with_bytes_build():
        automaton = ahocorasick.Automaton()
        words = b"he e hers his she hi him man he".split()
        #         0 1    2   3   4  5   6   7  8
        for i, w in enumerate(words):
            automaton.add_word(w, (i, w))

        expected_keys = [b'e', b'hers', b'his', b'she', b'hi', b'him', b'man', b'he', ]
        expected_values = [
            (1, b'e'),
            (2, b'hers'),
            (3, b'his'),
            (4, b'she'),
            (5, b'hi'),
            (6, b'him'),
            (7, b'man'),
            (8, b'he'),
        ]

        assert sorted(automaton.keys()) == sorted(expected_keys)
        assert sorted(automaton.values()) == sorted(expected_values)
>       assert sorted(dict(automaton.items()).values()) == sorted(expected_values)
E       AssertionError: assert [(1, b'e'), (2, b'hers'), (3, b'his'), (4, b'she'), (7, b'man'), (8, b'he')] == [(1, b'e'),\n (2, b'hers'),\n (3, b'his'),\n (4, b'she'),\n (5, b'hi'),\n (6, b'him'),\n (7, b'man'),\n (8, b'he')]
E         At index 4 diff: (7, b'man') != (5, b'hi')
E         Right contains 2 more items, first extra item: (7, b'man')
E         Full diff:
E           [
E            (1,
E             b'e'),
E            (2,
E             b'hers'),
E            (3,
E             b'his'),
E            (4,
E             b'she'),
E         -  (5,
E         -   b'hi'),
E         -  (6,
E         -   b'him'),
E            (7,
E             b'man'),
E            (8,
E             b'he'),
E           ]

tests/test_basic.py:71: AssertionError
_________________________ TestTrieIterators.test_items _________________________

self = <test_unit.TestTrieIterators testMethod=test_items>

    def test_items(self):
        A = self.A
        I = []
        for i, w in enumerate(self.words):
            A.add_word(conv(w), i + 1)
            I.append((conv(w), i + 1))

        L = [x for x in A.items()]
        self.assertEqual(len(L), len(I))
>       self.assertEqual(set(L), set(I))
E       AssertionError: Items in the first set but not the second:
E       (b'w\x00o\x00', 1)
E       (b'a\x00h', 3)
E       (b'p\x00y\x00t\x00', 2)
E       (b'c\x00o\x00r\x00a\x00', 4)
E       Items in the second set but not the first:
E       (b'word', 1)
E       (b'python', 2)
E       (b'corasick', 4)
E       (b'aho', 3)

tests/test_unit.py:431: AssertionError
=========================== short test summary info ============================
FAILED tests/test_basic.py::test_basic_items_keys_and_values_with_bytes_build
FAILED tests/test_unit.py::TestTrieIterators::test_items - AssertionError: It...
=================== 2 failed, 156 passed, 9 skipped in 1.51s ===================

pombredanne commented 2 years ago

And these failures with a bytes build on all OSes and all Pythons:

=================================== FAILURES ===================================
______________ test_basic_items_keys_and_values_with_bytes_build _______________

    @skipIf(ahocorasick.unicode, "Run only with bytes build")
    def test_basic_items_keys_and_values_with_bytes_build():
        automaton = ahocorasick.Automaton()
        words = b"he e hers his she hi him man he".split()
        #         0 1    2   3   4  5   6   7  8
        for i, w in enumerate(words):
            automaton.add_word(w, (i, w))

        expected_keys = [b'e', b'hers', b'his', b'she', b'hi', b'him', b'man', b'he', ]
        expected_values = [
            (1, b'e'),
            (2, b'hers'),
            (3, b'his'),
            (4, b'she'),
            (5, b'hi'),
            (6, b'him'),
            (7, b'man'),
            (8, b'he'),
        ]

        assert sorted(automaton.keys()) == sorted(expected_keys)
        assert sorted(automaton.values()) == sorted(expected_values)
>       assert sorted(dict(automaton.items()).values()) == sorted(expected_values)
E       AssertionError: assert [(1, b'e'), (2, b'hers'), (3, b'his'), (4, b'she'), (7, b'man'), (8, b'he')] == [(1, b'e'),\n (2, b'hers'),\n (3, b'his'),\n (4, b'she'),\n (5, b'hi'),\n (6, b'him'),\n (7, b'man'),\n (8, b'he')]
E         At index 4 diff: (7, b'man') != (5, b'hi')
E         Right contains 2 more items, first extra item: (7, b'man')
E         Full diff:
E           [
E            (1,
E             b'e'),
E            (2,
E             b'hers'),
E            (3,
E             b'his'),
E            (4,
E             b'she'),
E         -  (5,
E         -   b'hi'),
E         -  (6,
E         -   b'him'),
E            (7,
E             b'man'),
E            (8,
E             b'he'),
E           ]

tests/test_basic.py:71: AssertionError
_________________________ TestTrieIterators.test_items _________________________

self = <test_unit.TestTrieIterators testMethod=test_items>

    def test_items(self):
        A = self.A
        I = []
        for i, w in enumerate(self.words):
            A.add_word(conv(w), i + 1)
            I.append((conv(w), i + 1))

        L = [x for x in A.items()]
        self.assertEqual(len(L), len(I))
>       self.assertEqual(set(L), set(I))
E       AssertionError: Items in the first set but not the second:
E       (b'w\x00o\x00', 1)
E       (b'p\x00y\x00t\x00', 2)
E       (b'c\x00o\x00r\x00a\x00', 4)
E       (b'a\x00h', 3)
E       Items in the second set but not the first:
E       (b'python', 2)
E       (b'corasick', 4)
E       (b'word', 1)
E       (b'aho', 3)

tests/test_unit.py:431: AssertionError
=========================== short test summary info ============================
FAILED tests/test_basic.py::test_basic_items_keys_and_values_with_bytes_build
FAILED tests/test_unit.py::TestTrieIterators::test_items - AssertionError: It...

Weirdly enough these are the same failure than on Windows unicode builds...

pombredanne commented 2 years ago

Here is the tally:

I fixed the build that was mistakenly not correctly done as unicode or bytes on windows. There is still something not working and the Windows builds are always unicode for now.
the bytes build Automaton.items() method does not work and will need some love on the bytes builds

there is a regression with a memory leak on the unicode build (tested on Linux)


_____________________ TestMemory.test_does_not_leak_memory _____________________

self =

@unittest.skipIf(not on_linux, "Works only on linux")
def test_does_not_leak_memory(self):
    before = get_memory_usage()
    use_memory()
    after = get_memory_usage()

  assert before == after
E assert 116672.0 == 116928.0 E +116672.0 E -116928.0

tests/unresolved_bugs/test_bug_81.py:66: AssertionError

__ test_basic_items_keys_and_values_with_bytes_build ___

@skipIf(ahocorasick.unicode, "Run only with bytes build")
def test_basic_items_keys_and_values_with_bytes_build():
    automaton = ahocorasick.Automaton()
    words = b"he e hers his she hi him man he".split()
    #         1  2 3    4   5   6  7   8   9
    for i, w in enumerate(words, 1):
        automaton.add_word(w, (i, w))

    expected_keys = [b'e', b'hers', b'his', b'she', b'hi', b'him', b'man', b'he', ]
    expected_values = [
        # the second addition munges this by design, like in a dict (1, b'he'),
        (2, b'e'),
        (3, b'hers'),
        (4, b'his'),
        (5, b'she'),
        (6, b'hi'),
        (7, b'him'),
        (8, b'man'),
        (9, b'he'),
    ]

    expected_items = [
        # the second addition munges this by design, like in a dict  (b'he', (1, b'he')),
        (b'e', (2, b'e')),
        (b'hers', (3, b'hers')),
        (b'his', (4, b'his')),
        (b'she', (5, b'she')),
        (b'hi', (6, b'hi')),
        (b'him', (7, b'him')),
        (b'man', (8, b'man')),
        (b'he', (9, b'he')),
    ]

    assert sorted(automaton.keys()) == sorted(expected_keys)
    assert sorted(automaton.values()) == sorted(expected_values)

E - (b'corasick', 4), E - (b'python', 2), E - (b'word', 1), E + (b'c\x00o\x00r\x00a\x00', 4), E + (b'p\x00y\x00t\x00', 2), E + (b'w\x00o\x00', 1), E ]

tests/test_unit.py:433: AssertionError =========================== short test summary info ============================ FAILED tests/test_basic.py::test_basic_items_keys_and_values_with_bytes_build FAILED tests/test_unit.py::TestTrieIterators::test_items - AssertionError: as...

chaitanya710 commented 2 years ago

@chaitanya710 ... there are a few tests failing in this PR. Could you check what is the problem and why these are failing? Could the test be wrong? Or why do we have such a regression? If you have a fix, can you push a branch derived from this branch with a possible fix? Thanks!

@pombredanne sure I will check it and see what I can do

pombredanne commented 2 years ago

@chaitanya710 thanks @WojciechMula unless you have an objection I will merge and tag a beta release. The few pending issues existed before anyway and can be fixed afterwards

pombredanne commented 2 years ago

BTW, I do not think we have a memory leak. Rather the test was incorrect and is now corrected. We have an issue wrt. the Automaton.items() function on Windows and also for all bytes builds. See https://github.com/WojciechMula/pyahocorasick/blob/3e9478411ff361d4c3f274f01e8709bbd871caff/tests/unresolved_bugs/test_bug_81.py

pombredanne commented 2 years ago

I am merging and pushing a beta release now.

WojciechMula / pyahocorasick

Prepare release 2.0.0.b1 #166