WojciechMula / pyahocorasick

Python module (C extension and plain python) implementing Aho-Corasick algorithm
BSD 3-Clause "New" or "Revised" License
927 stars 122 forks source link

Is case-insensitive string match supported? #132

Open twang18 opened 3 years ago

twang18 commented 3 years ago

Hi, thanks for the great work! I am wondering if case-insensitive string match is supported. For example, when there is "information system" in the built Trie, and it can match "Information system", "Information System" and "INFORMATION SYSTEM" as well?

Dobatymo commented 3 years ago

To support case-insensitive matching you should insert lowercase words into the trie and then lowercase your input before you search for it.

twang18 commented 3 years ago

Thanks for the suggestion! This could be one solution.

WojciechMula commented 3 years ago

There's no case insensitive search option and TBH I don't like to add this to the core library. What I may suggest as an enhancement to the library is to provide an extra python wrapper class that would keep the interface, but do lowercase conversion underneath.

pombredanne commented 3 years ago

IMHO case insensitive is something best done outside (that's the way I do this)

zhu commented 3 years ago

https://github.com/nppoly/cyac#unicode Wrong offset may returns when convert to lowercase outside.

WojciechMula commented 3 years ago

@zhu Thanks, that's important. So, a safe way would be to use UCS-32 (4 bytes per code point). Which is not very memory-usage friendly.

zhu commented 3 years ago

@zhu Thanks, that's important. So, a safe way would be to use UCS-32 (4 bytes per code point). Which is not very memory-usage friendly.

I think UCS-32 cannot solve it. u"İ".lower() generate 2 unicode code point. And u"İ".lower().upper() != u"İ". There's some special casing: https://stackoverflow.com/questions/23524231/lower-case-of-turkish-character-dotted-i http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt

casing incurs a change in string length or is dependent on context or locale

WojciechMula commented 3 years ago

@zhu Thank a lot. It is more than strange. So there is no perfect solution.

pombredanne commented 3 years ago

I actually also experienced this same rare issue (not specific to pyahocorasick): https://github.com/nexB/scancode-toolkit/blob/f3ef5b4ad823577e507d673a7fbc65d5efe4f6af/src/licensedcode/match.py#L1432

NOTE: we have a rare Unicode bug/issue because of some Unicode codepoint such as some Turkish characters that decompose char + punct when casefolded. See https://github.com/nexB/scancode-toolkit/issues/1872 See also: https://bugs.python.org/issue34723#msg359514

This is eventually a well known problem known as "The Turkish-I Problem" See:

The way I solved it on my side is to check the length does not change when folding case, and if so do something special. This slows down everything of course. Even more so in my case because I split on punctuation before I lowercase (which makes me think that I could simplify my case by lower after my split ...)

See also https://bugs.python.org/issue34723#msg359514 and in particular these messages:

I know it is not finalized and released yet but are you going to implement Version 14.0.0 of the Unicode Standard? It finally solves the issue of Turkish lower/upper case 'I' and 'i'.

Here is the document

We don't update the unicodedata database in patch releases because updates are backwards incompatible. Python 3.9 will ship with 13.0. Python 3.10 is going to ship with 14.0.

IMHO we should wait for Python 3.10

pombredanne commented 3 years ago

Note that the rationale for my suggestion to wait is that Python is implementing Unicode and this is a bug in Unicode fixed in Unicode 14+ by introducing a new lower case dotted i character for Turkish that will have a length of one. Python 3.10 will implement Unicode 14

It could be fixed here too of course, but it would only hold if you make an assumption that we are processing unicode strings, which may not be always true.

zhu commented 3 years ago

Note that the rationale for my suggestion to wait is that Python is implementing Unicode and this is a bug in Unicode fixed in Unicode 14+ by introducing a new lower case dotted i character for Turkish that will have a length of one. Python 3.10 will implement Unicode 14

It could be fixed here too of course, but it would only hold if you make an assumption that we are processing unicode strings, which may not be always true.

German 'ß'.lower() == 'ß', 'ß'.upper() == 'SS', 'ẞ'.lower() == 'ß', 'ß'.upper().lower() == 'ss', 'ß'.casefold() == 'ss'. http://www.personal.psu.edu/ejp10/blogs/gotunicode/2008/07/a-new-german-unicode-letter-ca.html Seems the scancode-toolkit cannot handle this case.

BTW: Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. Unicode normalization also may change the text length. It has similar issues.

It'd better find a library which can remember the char offset when normalizing and casefolding text.

WojciechMula commented 3 years ago

Thank you @pombredanne for the explanation. I always thought that Unicode is just for assigning the numbers to characters. :)

pombredanne commented 1 year ago

So to the point of @WojciechMula I would rather keep this in a separate add-on and not in the C core. @zhu we now support bytes and unicode, so there should enough to do something with all the detailed gathered here. Would you be willing to help to craft this?