WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
918 stars 122 forks source link

Should work with bytes? #65

Open orbisvicis opened 6 years ago

orbisvicis commented 6 years ago

Hmm just noticed ahocorasick.unicode in the docs and TRIE_LETTER_TYPE in the source. Is there no way to support both bytes and strings in python 3?

orbisvicis commented 6 years ago

Constantly segfaults when I disable AHOCORASICK_UNICODE, even when not using pyahocorasick objects - py3.5.3.

WojciechMula commented 6 years ago

Bytes are supported. Could you please run python unitests.py? There're many tests that use bytes.

orbisvicis commented 6 years ago

This is with pristine pyahocorasick on python 3.5.3:

test (unittest.loader._FailedTest) ... ERROR

======================================================================
ERROR: test (unittest.loader._FailedTest)
----------------------------------------------------------------------
ImportError: Failed to import test module: test
Traceback (most recent call last):
  File "/usr/lib64/python3.5/unittest/loader.py", line 428, in _find_test_path
    module = self._get_module_from_name(name)
  File "/usr/lib64/python3.5/unittest/loader.py", line 369, in _get_module_from_name
    __import__(name)
  File "<...>/pyahocorasick/test.py", line 16, in <module>
    a.add_word(w, (i, w))
TypeError: string expected

----------------------------------------------------------------------
Ran 1 test in 0.001s

FAILED (errors=1)
Test failed: <unittest.runner.TextTestResult run=1 errors=1 failures=0>
error: Test failed: <unittest.runner.TextTestResult run=1 errors=1 failures=0>
sborpo commented 6 years ago

Hello, Is the fix to this bug is planned for the near future? I have the same problem :( Unfortunately cannot work in bytes mode

WojciechMula commented 4 years ago

Unfortunately I cannot devote as much time as previously. Waiting for contributors.

vapier commented 3 years ago

could we at least flip the default and make a 1.5 bump ? it's easy to convert strings->bytes, but hard to convert bytes->strings reliably.

Dobatymo commented 3 years ago

could we at least flip the default and make a 1.5 bump ? it's easy to convert strings->bytes, but hard to convert bytes->strings reliably.

Not exactly the best solution, but you can still use a codec which cannot fail to do the bytes to string conversation like latin-1.

pombredanne commented 2 years ago

@vapier hey :wave:

re:

could we at least flip the default and make a 1.5 bump ? it's easy to convert strings->bytes, but hard to convert bytes->strings reliably.

Can you be more explicit? what would need to be fixed/changed and where?

vapier commented 2 years ago

https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+/refs/heads/release-R100-14526.B/dev-python/pyahocorasick/files/pyahocorasick-1.4.0-bytes.patch

pombredanne commented 2 years ago

@vapier you rock! .. I guess there is an interesting derived tidbit: pyahocorasick may be used in chromium OS? :innocent:

And I see you are also the author to credit for the patch https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+log/refs/heads/release-R100-14526.B/dev-python/pyahocorasick/files/pyahocorasick-1.4.0-bytes.patch

I also now recall of so many of your awesome contribs like on strace and so many other places. I am not worthy.

pombredanne commented 2 years ago

@WojciechMula this deserves a 1.5 bump indeed. I reckon I am not impacted in most cases as usually stuff only integers in automatons. But I wonder if we should not publish two builds now that I started pushing pre-built wheels for all OSes: one for bytes and one for unicode?

WojciechMula commented 2 years ago

Oh man, it's really great news about usage! @pombredanne I think it's better to finish releasing and then add patch. More work, but a bit more logical. Of course IMHO it's up to you.

Anyway, it would be great if we provided specialisations for each type and then have some lightweight factory for this, even in pure python. This would be pretty easy if we have some code generator underneath.

pombredanne commented 2 years ago

I pushed some tests in this branch https://github.com/WojciechMula/pyahocorasick/tree/release-2.0.0

Define either AHOCORASICK_UNICODE=yes or AHOCORASICK_BYTES=yes to build one or the other such as: with AHOCORASICK_UNICODE=yes pip install -e . or AHOCORASICK_BYTES=yes make

There are some issues that show up:

@vapier the new environment variable should help you at the minimum get rid of your setup.py patch

Now to support strings & bytes simultaneously, we could either build both and manage a factory in Python as @WojciechMula suggested above... or find a way to deal with both together.

I have been toying with the idea of an alternative unified way (on the C side) where the number of symbols in some "alphabet" would be the key input as opposed to the implied bytes = length 1 (e.g. 256 symbols in alphabet) and unicode = variable length based on build.

Also re: https://github.com/WojciechMula/pyahocorasick/issues/65#issuecomment-1046271016

I reckon I am not impacted in most cases as usually stuff only integers in automatons.

Actually I usually use sequences of integers as keys with ahocorasick.KEY_SEQUENCE I could very much benefit from less memory usage if the KEY_SEQUENCE integers could be 16bits rather 32bits... My set of unique values are in a well defined range roughly under 15 bits long.

pombredanne commented 2 years ago

You can see the runs failures there https://github.com/WojciechMula/pyahocorasick/actions/runs/1952208882

pombredanne commented 1 year ago

At this stage all the bytes-based tests pass. I wonder if it would be useful to build two different packages like pyahocorasick and pyahocorasick-bytes to make it easier for users to pick a byte build? In this case the wheels would be built for bytes and the sdist would be by default building as bytes. This would be fairly simple.

pombredanne commented 1 year ago

We have #165 as a related issue

vapier commented 1 year ago

personally i'd rip the bandaid off, but failing that, publishing two wheels with the diff names sounds OK assuming they can be installed in parallel

i guess there's no way to build 2 C components into a single package and pick the right one based on the inputs ?