Closed ericxsun closed 5 years ago
Thanks a lot.
In Py2, the key cannot be unicode, and the to be matched string still require in format of string not unicode. So the index of matched word is range in bytes, not the unicode side. That is okay for English string, but it is quite expensive for Asia language, like Chinese, at each time, doing a special encode from unicode to UTF-8. However, if all strings are represented using UTF-8, it's ok.
I need the index of matched to be in range in unicode unit. What can I do?
Any help will be highly appreciated.
@pombredanne We could add support in Py2, since there is a real need. But there are two ongoing tasks that have to be completed first.
continuing: and for sure we need better test suite, providing (almost) full code coverage.
In where the code should be refactored for supporting unicode in Py2? @WojciechMula
@ericxsun If I correctly got your question -- there are a few places to touch. But these places are subject of changes during work on #27.
@WojciechMula you wrote:
and for sure we need better test suite, providing (almost) full code coverage.
One thing that would help there is to use a more "conventional" naming and layout for the test files. I can suggest some mini surgery and reorg in a PR
@WojciechMula you wrote:
@ericxsun If I correctly got your question -- there are a few places to touch. But these places are subject of changes during work on #27.
At some level arbitrary integer sequences and Unicode are more or less related. You could think of supporting an arbitrary finite range of integers as having some arbitrary alphabet (where each int could be mapped to a character in that alphabet) with some twists of course for Unicode proper as even if unichr
provides such a mapping, the minute details of unicode strings are actually a tad different than just an integer range.
I hope I make some sense! but this may not be helpful...
@pombredanne
One thing that would help there is to use a more "conventional" naming and layout for the test files. I can suggest some mini surgery and reorg in a PR
You're welcome :) Tests were evolving, and the result of evolution sometimes scares.
@pombredanne
At some level arbitrary integer sequences and Unicode are more or less related. You could think of supporting an arbitrary finite range of integers as having some arbitrary alphabet (where each int could be mapped to a character in that alphabet) with some twists of course for Unicode proper as even if unichr provides such a mapping, the minute details of unicode strings are actually a tad different than just an integer range.
I think we end up with this. But this would require conversions between Unicode/bytes and int ranges (sometimes no conversion is needed).
BTW I'm still considering building different versions of a module for each string type, using a code generator. I'm a bit afraid of performance if everything will be tested/converted in run-time. However, I'm still not sure about this approach.
I hope I make some sense! but this may not be helpful...
Definitively you're right. :)
I found a simply way
content = content.encode("utf-8")
for end_idx, (word, prop) in automaton.iter(content):
_content = content[:end_idx+1].decode("utf-8")
word = word.decode("utf-8")
e = len(_content)
s = e - len(word)
....
As long as decoding overhead is not significant, it's OK. But it's a workaround. :)
@ericxsun This is a known issue: pyahocorasick builds by default without Unicode support on Python 2 for keys (you can store Unicode strings as values alright). Python 3 support is working. I am not sure what it would entail to support this especially in a cross Py2/3 support. Do you think this is something you could help with?