Closed woakesd closed 1 year ago
Could you please attach output from the program, on my Linux machine it prints:
$ ~/github/pyahocorasick$ python3 bug_53.py
(6, 'punched') punched
(16, 'punched') punched
I get the following:
PS C:\Projects\pyahocorasick> py -3 .\bug_53.py (6, 'punched') punched (17, 'punched') unched?
I'm pretty sure this is because in windows the string passed in is two byte characters, and the higher plane characters are sent as complementary pairs (two two byte characters).
@woakesd I will check it tomorrow, when download MSVC build toolchain. I have just MSVC 2008, Py3k needs the newest one.
You can use VS2015 community edition.
Good luck.
On 17 Jan 2017, at 18:31, Wojciech Muła notifications@github.com wrote:
@woakesd I will check it tomorrow, when download MSVC build toolchain. I have just MSVC 2008, Py3k needs the newest one.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@woakesd OK, I've just reproduced the error, now need to work out a solution.
@woakesd I pushed the fix to the master. Could you please check it?
Tested this and the simple test passes, which is good.
But on a larger test with 50000 strings in the Automaton I get ValueError exceptions thrown randomly when searching a series of messages. I'm repeating the test and it fails on a different message each time.
Example:
Malformed UCS-2 string: expected a high surrogate at 1662, got dd00 (the position and code change on each run).
I'm struggling to produce a simple test which demonstrates this though. Grr.
Initially I was suspicious of my data but after running it repeatedly it fails in different places.
I don't really want to post my test data here, can I send them to you privately?
Thank you very much for testing! Please send your test at wojciech_mula@poczta.onet.pl Nothing would leak, don't worry.
Thank you! Will check out the changes later tonight or tomorrow perhaps. It may need to wait till Monday.
This is fixed as far as I am concerned so I am closing it.
I'm putting together another regression test for this as I found an issue which I think may be related
PS I found a work around which is to add one space for each character in the supplementary plane.
haystack += ' ' * len([c for c in haystack if ord(c) > 65535])
Sorry for the problem, will check it soon.
@woakesd Is this still an issue? I think this was fixed, but I am not 100% sure... would you be kind enough to check on the latest checkout?
I think not, but I’ll need to check.
David
📞 0131 332 7571 📱 07966 444 615 📧 @.*** 💻 www.woakes.net
On 20 Feb 2022, at 11:09, Philippe Ombredanne @.***> wrote:
 @woakesd Is this still an issue? I think this was fixed, but I am not 100% sure... would you be kind enough to check on the latest checkout?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.
@woakesd I am closing this for now: we have windows tests that pass alright. Please reopen if this is still an issue
This is an issue with Python under windows only. The attached sample illustrates the problem.
I presume this is because Windows python builds are narrow builds (16 bits per character and supplementary plane characters are stored as a supplementary pair).
I can add the 🙈 as a word and I get a match for it, though end is 9!
issue.txt