andgineer / TRegExpr

Regular expressions (regex), pascal.
https://regex.sorokin.engineer/en/latest/
MIT License
174 stars 63 forks source link

Case insensitive check for Unicode #223

Open Alexey-T opened 4 years ago

Alexey-T commented 4 years ago

I got comment

It is also not fully correct. E.g. K (U+212a) should match K or k case insensitively

So our approach has weak place. We match case insensitive using 2 arrays- to lowercase, to uppercase. How to fix?

AmigoJack commented 1 year ago

Covering this for every letter is possible, but results in huge arrays. https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt covers all Unicode graphemes/letters. As described in https://www.unicode.org/L2/L1999/UnicodeData.html the last 3 columns will tell you if the letter has an uppercase/lowercase/titlecase counterpart. For example:

  1. 0030;DIGIT ZERO;Nd;0;EN;;0;0;0;N;;;;;

    • Last 3 columns are empty, so the codepoint U+0030 is uninteresting for case insensitivity.
  2. 0044;LATIN CAPITAL LETTER D;Lu;0;L;;;;;N;;;;0064;

    • The "lowercase mapping" column is filled: codepoint U+0044's lowercase counterpart is U+0064.
  3. 212A;KELVIN SIGN;Lu;0;L;004B;;;;N;DEGREES KELVIN;;;006B;

    • The "lowercase mapping" column is filled: codepoint U+212A's lowercase counterpart is U+006B.
  4. 01C5;LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON;Lt;0;L;<compat> 0044 017E;;;;N;LATIN LETTER CAPITAL D SMALL Z HACEK;;01C4;01C6;01C5

    • Uppercase mapping is U+01C4.
    • Lowercase mapping is U+01C6.
    • Titlecase mapping (when the letter is the first of a word) is U+01C5 (itself).

From those definitions I made arrays and lookup functions myself, covering all 2360 (BMP) + 520 (SMP) cases for turning one letter to the other. Plus 47 letters and how they should be normalized when trying to "fold" them for case insensitivity, too. It's written for Delphi 7, so searching for SMP characters would only work via FindChar() and not via UniToLower(). I included a full working program testing a few examples for lowercasing individual codepoints and showing how to use uUnicodeCase.pas, which would be something TRegExpr would need to include: UnicodeCasing.zip

Alexey-T commented 1 year ago

2360+520 possible cases? For all unicode characters only these have the lowercase/uppercase pair?

AmigoJack commented 1 year ago

Yes. Capital and small letters are a concept not used in many languages/alphabets, and signs which look like capital and small variants may still be different things and can't be exchanged (f.e. Katakanas). Have a read on https://en.wikipedia.org/wiki/Letter_case

Alexey-T commented 1 year ago

I didnt see your code yet. Im not at PC. If those char codes (which are changed by case operation) are located in few compact ranges, we can write simple change-case functions (3 of them). Can you write these functions?

AmigoJack commented 1 year ago

Just look at the code - I already described there's a full working sample program with it. You should be able to understand how the unit works. Ranges are difficult to come up with - by far not all letters are aligned like Latin ones. The lookup per character=codepoint however is already optimized for a quite fast access. If you still have questions, feel free to ask them. Preferably afterwards. ;) If TRegExpr.InvertCaseFunction() still exists then you could easily change it into either using my functions or writing similar functions (or yes: I could write them). Before doing so just have a peek at the arrays and functions... and if you're fine with them I can surely also help to fully incorporate it into TRegExpr.

Alexey-T commented 1 year ago

I see the code: very big work is done. good. now i ask: now much the speed loss is? today, TRegExpr lowercase/upcase are very fast (but they need 2 130Kb arrays: to lower, to upper). will it be much slower if we replace funcs to your funcs?

if not much slower: where can we find 3 functions: to-lower, to-upper, to-title? I mean funcs which change WideChar to WideChar. your code converts WideChar to some record with Cardianal members.

maybe you can make final patch (or changed tregexpr)?

Alexey-T commented 1 year ago

this place is to change in regexpr.pas:

{$IFDEF FastUnicodeData}
function _UpperCase(Ch: REChar): REChar; {$IFDEF InlineFuncs}inline;{$ENDIF}
begin
  Result := CharUpperArray[Ord(Ch)];
end;

function _LowerCase(Ch: REChar): REChar; {$IFDEF InlineFuncs}inline;{$ENDIF}
begin
  Result := CharLowerArray[Ord(Ch)];
end;
AmigoJack commented 1 year ago

No, you haven't understood it throughly. Just because it's a function you don't need to care for the result - its parameter is a reference, not only a value. You could do:

function _LowerCase( Ch: REChar ): REChar;
begin
  result:= Ch;
  uUnicodeCase.UniToLower( result );  // parameter is VAR for a reason
end;

The function result just tells you if there was a match. Both other functions act the same way. Be aware that in those functions I casted the WideChar parameter to Word (both being 16 bit) due to only acting on Delphi 7. But Unicode wise this only works for codepoints U+0000 to U+FFFF, ignoring anything beyond (like Emojis). I'm not sure how Char works on other Delphi or FPC versions - logic wise it must also be capable to store at least 20 bit (likely a Cardinal = 32 bit). In Delphi 7 WideString encodes text in UTF-16, so any codepoints beyond U+FFFF are stored in two WideChars (which poses a logical problem to properly convert them to upper/lower/title case. Luckily Unicode's Supplementary Multilingual Plane (codepoints U+10000 to U+1FFFF) has only lower/upper case letters which should be encountered very rarily (I guess that's the reason why all SMP isn't even available/ignored in your regexpr_unicodedata.pas).

Speed wise it must be slower than your two giant arrays (at worst 177 jumps and 15 loop iterations for 1 character, at best 1 jump and 1 loop iteration), but on the contrary my approach should need much less memory/storage. If you want I can benchmark both (or at least write code for it). Sometimes I'm eager for speed, sometimes I want to avoid bloat - both approaches have their advantages.

Either way - codepoints beyond U+FFFF should be dealth with, also for the Unicode category (as "word character"). If picking one character out of a String in recent Delphi versions can go beyond U+FFFF then let me at least double two of your arrays, so they also go until U+1FFFF, and increase the category array up to U+E01EF (14x as big). We could provide compiler switches for people who really want all Unicode codepoints covered and don't care that they then have 256 KiB + 256 KiB + 896,5 KiB = 1,38 MiB of constant data.

Alexey-T commented 1 year ago

I got it, so I can write the patch. one thing why Im not entusuast here. i don't understand the code, ie those 2 arrays, how you create them. w/o understanding it, i won't comply with idea.

AmigoJack commented 1 year ago

You won't like it immediately, because it's again code that needs to be understood. But it got all you want:

But it all works:

It still only works for BMP (that is U+0000 to U+FFFF) and not beyond. The test project may also be a base for benchmarking/comparing both ways (this against the huge arrays) for speed (any maybe memory consumption).

make_unicase.zip