WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
948 stars 125 forks source link

End position out after supplementary plane character in Windows #53

Closed woakesd closed 1 year ago

woakesd commented 7 years ago

This is an issue with Python under windows only. The attached sample illustrates the problem.

I presume this is because Windows python builds are narrow builds (16 bits per character and supplementary plane characters are stored as a supplementary pair).

I can add the 🙈 as a word and I get a match for it, though end is 9!

issue.txt

WojciechMula commented 7 years ago

Could you please attach output from the program, on my Linux machine it prints:

$ ~/github/pyahocorasick$ python3 bug_53.py 
(6, 'punched') punched
(16, 'punched') punched
woakesd commented 7 years ago

I get the following:

PS C:\Projects\pyahocorasick> py -3 .\bug_53.py (6, 'punched') punched (17, 'punched') unched?

I'm pretty sure this is because in windows the string passed in is two byte characters, and the higher plane characters are sent as complementary pairs (two two byte characters).

WojciechMula commented 7 years ago

@woakesd I will check it tomorrow, when download MSVC build toolchain. I have just MSVC 2008, Py3k needs the newest one.

woakesd commented 7 years ago

You can use VS2015 community edition.

Good luck.

On 17 Jan 2017, at 18:31, Wojciech Muła notifications@github.com wrote:

@woakesd I will check it tomorrow, when download MSVC build toolchain. I have just MSVC 2008, Py3k needs the newest one.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

WojciechMula commented 7 years ago

@woakesd OK, I've just reproduced the error, now need to work out a solution.

WojciechMula commented 7 years ago

@woakesd I pushed the fix to the master. Could you please check it?

woakesd commented 7 years ago

Tested this and the simple test passes, which is good.

But on a larger test with 50000 strings in the Automaton I get ValueError exceptions thrown randomly when searching a series of messages. I'm repeating the test and it fails on a different message each time.

Example:

Malformed UCS-2 string: expected a high surrogate at 1662, got dd00 (the position and code change on each run).

I'm struggling to produce a simple test which demonstrates this though. Grr.

Initially I was suspicious of my data but after running it repeatedly it fails in different places.

woakesd commented 7 years ago

I don't really want to post my test data here, can I send them to you privately?

WojciechMula commented 7 years ago

Thank you very much for testing! Please send your test at wojciech_mula@poczta.onet.pl Nothing would leak, don't worry.

woakesd commented 7 years ago

Thank you! Will check out the changes later tonight or tomorrow perhaps. It may need to wait till Monday.

woakesd commented 7 years ago

This is fixed as far as I am concerned so I am closing it.

woakesd commented 7 years ago

I'm putting together another regression test for this as I found an issue which I think may be related

woakesd commented 7 years ago

PS I found a work around which is to add one space for each character in the supplementary plane.

haystack += ' ' * len([c for c in haystack if ord(c) > 65535])
WojciechMula commented 7 years ago

Sorry for the problem, will check it soon.

pombredanne commented 2 years ago

@woakesd Is this still an issue? I think this was fixed, but I am not 100% sure... would you be kind enough to check on the latest checkout?

woakesd commented 2 years ago

I think not, but I’ll need to check.

David

📞 0131 332 7571 📱 07966 444 615 📧 @.*** 💻 www.woakes.net

On 20 Feb 2022, at 11:09, Philippe Ombredanne @.***> wrote:

 @woakesd Is this still an issue? I think this was fixed, but I am not 100% sure... would you be kind enough to check on the latest checkout?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.

pombredanne commented 1 year ago

@woakesd I am closing this for now: we have windows tests that pass alright. Please reopen if this is still an issue