WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
944 stars 122 forks source link

Cannot use unicode in Py2 (c version) #40

Closed ericxsun closed 5 years ago

pombredanne commented 8 years ago

@ericxsun This is a known issue: pyahocorasick builds by default without Unicode support on Python 2 for keys (you can store Unicode strings as values alright). Python 3 support is working. I am not sure what it would entail to support this especially in a cross Py2/3 support. Do you think this is something you could help with?

ericxsun commented 8 years ago

Thanks a lot.

In Py2, the key cannot be unicode, and the to be matched string still require in format of string not unicode. So the index of matched word is range in bytes, not the unicode side. That is okay for English string, but it is quite expensive for Asia language, like Chinese, at each time, doing a special encode from unicode to UTF-8. However, if all strings are represented using UTF-8, it's ok.

I need the index of matched to be in range in unicode unit. What can I do?

Any help will be highly appreciated.

WojciechMula commented 8 years ago

@pombredanne We could add support in Py2, since there is a real need. But there are two ongoing tasks that have to be completed first.

WojciechMula commented 8 years ago

continuing: and for sure we need better test suite, providing (almost) full code coverage.

ericxsun commented 8 years ago

In where the code should be refactored for supporting unicode in Py2? @WojciechMula

WojciechMula commented 8 years ago

@ericxsun If I correctly got your question -- there are a few places to touch. But these places are subject of changes during work on #27.

pombredanne commented 8 years ago

@WojciechMula you wrote:

and for sure we need better test suite, providing (almost) full code coverage.

One thing that would help there is to use a more "conventional" naming and layout for the test files. I can suggest some mini surgery and reorg in a PR

pombredanne commented 8 years ago

@WojciechMula you wrote:

@ericxsun If I correctly got your question -- there are a few places to touch. But these places are subject of changes during work on #27.

At some level arbitrary integer sequences and Unicode are more or less related. You could think of supporting an arbitrary finite range of integers as having some arbitrary alphabet (where each int could be mapped to a character in that alphabet) with some twists of course for Unicode proper as even if unichr provides such a mapping, the minute details of unicode strings are actually a tad different than just an integer range.

I hope I make some sense! but this may not be helpful...

WojciechMula commented 8 years ago

@pombredanne

One thing that would help there is to use a more "conventional" naming and layout for the test files. I can suggest some mini surgery and reorg in a PR

You're welcome :) Tests were evolving, and the result of evolution sometimes scares.

WojciechMula commented 8 years ago

@pombredanne

At some level arbitrary integer sequences and Unicode are more or less related. You could think of supporting an arbitrary finite range of integers as having some arbitrary alphabet (where each int could be mapped to a character in that alphabet) with some twists of course for Unicode proper as even if unichr provides such a mapping, the minute details of unicode strings are actually a tad different than just an integer range.

I think we end up with this. But this would require conversions between Unicode/bytes and int ranges (sometimes no conversion is needed).

BTW I'm still considering building different versions of a module for each string type, using a code generator. I'm a bit afraid of performance if everything will be tested/converted in run-time. However, I'm still not sure about this approach.

I hope I make some sense! but this may not be helpful...

Definitively you're right. :)

ericxsun commented 8 years ago

I found a simply way

content = content.encode("utf-8")

for end_idx, (word, prop) in automaton.iter(content):
    _content = content[:end_idx+1].decode("utf-8")
    word = word.decode("utf-8")

    e = len(_content)
    s = e - len(word)

  ....
WojciechMula commented 8 years ago

As long as decoding overhead is not significant, it's OK. But it's a workaround. :)