Open orbisvicis opened 6 years ago
Constantly segfaults when I disable AHOCORASICK_UNICODE, even when not using pyahocorasick objects - py3.5.3.
Bytes are supported. Could you please run python unitests.py? There're many tests that use bytes.
This is with pristine pyahocorasick on python 3.5.3:
test (unittest.loader._FailedTest) ... ERROR
======================================================================
ERROR: test (unittest.loader._FailedTest)
----------------------------------------------------------------------
ImportError: Failed to import test module: test
Traceback (most recent call last):
File "/usr/lib64/python3.5/unittest/loader.py", line 428, in _find_test_path
module = self._get_module_from_name(name)
File "/usr/lib64/python3.5/unittest/loader.py", line 369, in _get_module_from_name
__import__(name)
File "<...>/pyahocorasick/test.py", line 16, in <module>
a.add_word(w, (i, w))
TypeError: string expected
----------------------------------------------------------------------
Ran 1 test in 0.001s
FAILED (errors=1)
Test failed: <unittest.runner.TextTestResult run=1 errors=1 failures=0>
error: Test failed: <unittest.runner.TextTestResult run=1 errors=1 failures=0>
Hello, Is the fix to this bug is planned for the near future? I have the same problem :( Unfortunately cannot work in bytes mode
Unfortunately I cannot devote as much time as previously. Waiting for contributors.
could we at least flip the default and make a 1.5 bump ? it's easy to convert strings->bytes, but hard to convert bytes->strings reliably.
could we at least flip the default and make a 1.5 bump ? it's easy to convert strings->bytes, but hard to convert bytes->strings reliably.
Not exactly the best solution, but you can still use a codec which cannot fail to do the bytes to string conversation like latin-1
.
@vapier hey :wave:
re:
could we at least flip the default and make a 1.5 bump ? it's easy to convert strings->bytes, but hard to convert bytes->strings reliably.
Can you be more explicit? what would need to be fixed/changed and where?
@vapier you rock! .. I guess there is an interesting derived tidbit: pyahocorasick may be used in chromium OS? :innocent:
And I see you are also the author to credit for the patch https://chromium.googlesource.com/chromiumos/overlays/chromiumos-overlay/+log/refs/heads/release-R100-14526.B/dev-python/pyahocorasick/files/pyahocorasick-1.4.0-bytes.patch
I also now recall of so many of your awesome contribs like on strace and so many other places. I am not worthy.
@WojciechMula this deserves a 1.5 bump indeed. I reckon I am not impacted in most cases as usually stuff only integers in automatons. But I wonder if we should not publish two builds now that I started pushing pre-built wheels for all OSes: one for bytes and one for unicode?
Oh man, it's really great news about usage! @pombredanne I think it's better to finish releasing and then add patch. More work, but a bit more logical. Of course IMHO it's up to you.
Anyway, it would be great if we provided specialisations for each type and then have some lightweight factory for this, even in pure python. This would be pretty easy if we have some code generator underneath.
I pushed some tests in this branch https://github.com/WojciechMula/pyahocorasick/tree/release-2.0.0
Define either AHOCORASICK_UNICODE=yes
or AHOCORASICK_BYTES=yes
to build one or the other such as:
with AHOCORASICK_UNICODE=yes pip install -e .
or AHOCORASICK_BYTES=yes make
There are some issues that show up:
@vapier the new environment variable should help you at the minimum get rid of your setup.py patch
Now to support strings & bytes simultaneously, we could either build both and manage a factory in Python as @WojciechMula suggested above... or find a way to deal with both together.
I have been toying with the idea of an alternative unified way (on the C side) where the number of symbols in some "alphabet" would be the key input as opposed to the implied bytes = length 1 (e.g. 256 symbols in alphabet) and unicode = variable length based on build.
Also re: https://github.com/WojciechMula/pyahocorasick/issues/65#issuecomment-1046271016
I reckon I am not impacted in most cases as usually stuff only integers in automatons.
Actually I usually use sequences of integers as keys with ahocorasick.KEY_SEQUENCE I could very much benefit from less memory usage if the KEY_SEQUENCE integers could be 16bits rather 32bits... My set of unique values are in a well defined range roughly under 15 bits long.
You can see the runs failures there https://github.com/WojciechMula/pyahocorasick/actions/runs/1952208882
Automaton.items()
methodAt this stage all the bytes-based tests pass. I wonder if it would be useful to build two different packages like pyahocorasick and pyahocorasick-bytes to make it easier for users to pick a byte build? In this case the wheels would be built for bytes and the sdist would be by default building as bytes. This would be fairly simple.
We have #165 as a related issue
personally i'd rip the bandaid off, but failing that, publishing two wheels with the diff names sounds OK assuming they can be installed in parallel
i guess there's no way to build 2 C components into a single package and pick the right one based on the inputs ?
Hmm just noticed
ahocorasick.unicode
in the docs andTRIE_LETTER_TYPE
in the source. Is there no way to support both bytes and strings in python 3?