Add support for Turkish 'i' casing

zherczeg commented 1 month ago

This code adds many specializations, and I am still thinking how could we optimize it.

If I understand correctly, the Turkish casing removes the i == I caseless rule, that is what it makes it incompatible with other character sets. It also adds two new rules: 'i' == U+0130 and U+0131 == I. Is this correct? I is 0x49 and i is 0x69.

Their choices for constants does makes it hard. Maybe some macros could help.

NWilson commented 1 month ago

If I understand correctly, the Turkish casing removes the i == I caseless rule, that is what it makes it incompatible with other character sets. It also adds two new rules: 'i' == U+0130 and U+0131 == I. Is this correct? I is 0x49 and i is 0x69.

Yes, exactly correct.

The most-annoying thing is that even infects the ASCII fast-path processing, not just the slower UCD record access.

Thankfully, there aren't loads of places where PCRE2 needs to worry about casing of characters. And some of them don't need to be updated - for example, we don't need change the idea of casing used by the "first code-unit" support.

What I think we want:

OP_CHARI should just do "standard case-insensitivity" checking (according to UCD)
During compiling, when we convert pattern characters to OPs, if we see something like an "i" character (and Turkish is set), or we see something affected by ASCII-caseless-restrict, then we can generate special opcodes like PT_CLIST that are totally explicit about what characters are matched.

So the Turkish (and ASCII-restrict) casing should really only affect whether characters are compiled to standard or special-cased opcodes (...and also the OP_REFI impact). That's not too bad, it's relatively restricted and clear.

I've largely only had to add Turkish pollution to chunks of code that already had ASCII-caseless-restrict pollution.

NWilson commented 1 month ago

I just noticed something. It seems the turkish flag overwrites the restricted flag. Is this correct?

No, it shouldn't do. They ought to work together fine.

CaseFolding.txt describes the case-equivalences between characters. CASELESS_RESTRICT unlinks the equivalences (from that file) which cross the ASCII boundary; it only affects the "k/Kelvin" and "s/Long-S" equivalences. TURKISH_CASING sets new equivalences for the 4 "i" characters.

I should add tests for them both at once. It looks like I forgot to do that.

zherczeg commented 1 month ago

Btw, when both restricted and turkish are enabled, the i and I will simply not match, is this correct? The new tests should test both classes and simple characters, not simple characters. More things can be done in a single test: /[a-z][^i]I/ucp,turkish,restricted

carenas commented 1 month ago

Btw, when both restricted and turkish are enabled, the i and I will simply not match, is this correct?

That is what I suggested as well but couldn't convince Nick; the documentation isn't clear about the expected behaviour (which I find confusing) either:

  re> /[\x{00FF}-\x{FFEE}]/ir,utf,turkish_casing
data> i
 0: i
data> I
 0: I

And could even qualify as a regression for #11.

NWilson commented 1 month ago

How the two flags should work together?

I've added documentation in the pcre2unicode doc page.

NWilson commented 1 month ago

Zoltan, using the "PT_CLIST" for the Turkish I characters causes an assertion failure in SLJIT:

SLJIT_ASSERT(other_cases[0] != NOTACHAR && other_cases[1] != NOTACHAR && other_cases[2] != NOTACHAR);

Maybe now would be the time to introduce OP_2CHAR... although that would mean even more JIT work to implement it (but fairly simple at least, I hope).

I've also tried to make the change in compile_ref_matchingpath() but struggling with the JIT syntax.

zherczeg commented 1 month ago

Let me do more free work for microsoft: https://github.com/zherczeg/pcre2/commit/090f194b877e289044790e8775f3ffa8d8159cf6

Btw, it would be good to test ucp (no utf) caseless turkish backreferences. Is there tests for turkish + restricted?

PhilipHazel commented 1 month ago

Picking up on a couple of points:

As far as (?r) goes, we could introduce (*CASELESSRESTRICT) without causing any incompatibility. At the same time, the use of (?r) could be deprecated for a release or two before removal. I agree with NIck (and therefore disagree with Carlo) that the need for this to vary in different parts of the pattern is minimal. However_, I note that the nearest thing Perl has to (?r) is (?aa) which can be varied throughout the pattern. This output is from perltest.sh:

Perl v5.40.0
/k(?aa)k/i
\x{212a}k
 0: \x212ak
\x{212a}\x{212a}
No match

Perl's /aa is not the same as PCRE2's /r because /aa also makes \d etc into ASCII-only items. That must have been why I implemented (?aD) etc in PCRE2 as separate switches. Perhaps we should change (?r) into (?aR) and (?t) into (?aT), though I'm just throwing out ideas here. Oops, no, (?aT) is already taken. (?aI) is free.... This is all a mess.

As far as EBCDIC is concerned, the code should compile in two different ways in an EBCDIC environment: (a) dealing with 8-bit EBCDIC as input and (b) reading Unicode encoded as UTF-8 as input. This was requested for the processing of UTF-8 files in an EBCDIC environment. In case (a) there is clearly no need for ASCII/Unicode support. As I understand it, IBM's z/OS has a native environment, which is what Ze'ev Atlas works in, but also "Unix system services" which "allows UNIX applications from other platforms to run on IBM System z mainframes running z/OS" (Wikipedia). In theory, there is a small amount of testing that can be done by compiling PCRE2 --with-ebcdic on an ASCII system, but it is many years since I tried it.

carenas commented 1 month ago

(b) reading Unicode encoded as UTF-8 as input.

but there is no UTF-8 in EBCDIC systems, are you sure it is not UTF-EBCDIC which is what Perl supports with use utf8 there?

if what they were actually using is UTF-16, then the code is correct because it is compatible with UTF-8, though and uses the same scalars for all 4 characters, but if it is the other UTF which seems to match better with EBCDIC CCSID 1026 then at least U+69 will be encoded as x89 and U+131 (AKA dotless i) would be in x79 (which are still 1 byte encodings) even with "UTF"

NWilson commented 1 month ago

You certainly can have UTF-8 files on an EBCDIC system. There's nothing magic about an EBCDIC system that stops you storing any binary on it. I think in UTF mode on an EBCDIC system, PCRE2 appears to just do... use normal UTF.

NWilson commented 1 month ago

I have:

Removed the (?t) inline option, and updated the documentation
Added a start-of-pattern option (*TURKISH_CASING)
Pulled in Zoltan's commit (many, many thanks!)
Added some additional tests

There is one test case that's failing for the JIT, which I'll have a go fixing in the morning tomorrow.

Otherwise, I think this is basically good to go.

zherczeg commented 1 month ago

It looks like my incorrect suggestion was causing the failure:

--- a/src/pcre2_jit_compile.c
+++ b/src/pcre2_jit_compile.c
@@ -9525,12 +9525,13 @@ if ((common->utf || common->ucp) && (*cc == OP_REFI || *cc == OP_DNREFI))
   OP2(SLJIT_ADD, TMP1, 0, TMP1, 0, TMP3, 0);
   CMPTO(SLJIT_EQUAL, TMP1, 0, char1_reg, 0, loop);

-  if (refi_flag & REFI_FLAG_CASELESS_RESTRICT)
-    add_jump(compiler, &no_match, CMP(SLJIT_LESS, char1_reg, 0, SLJIT_IMM, 128));
   add_jump(compiler, &no_match, CMP(SLJIT_EQUAL, TMP2, 0, SLJIT_IMM, 0));
   OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 2);
   OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, SLJIT_IMM, (sljit_sw)PRIV(ucd_caseless_sets));

+  if (refi_flag & REFI_FLAG_CASELESS_RESTRICT)
+    add_jump(compiler, &no_match, CMP(SLJIT_LESS | SLJIT_32, SLJIT_MEM1(TMP2), 0, SLJIT_IMM, 128));
+
   caseless_loop = LABEL();
   OP1(SLJIT_MOV_U32, TMP1, 0, SLJIT_MEM1(TMP2), 0);
   OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, SLJIT_IMM, sizeof(uint32_t));

The first character of the caseset needs to be checked for ascii, not the current character. The test was good to reveal it! Thank you!

NWilson commented 1 month ago

I have updated the CASELESS_RESTRICT and TURKISH_CASING behaviour, as requested.

They generate a compile error used together (if passed as options, or as start-of-pattern options)
They both generate a compile error if used when neither UTF or UCP is enabled
TURKISH_CASING generates a compile error in the 8-bit library if used without UTF

Finally, these checks are not triggered if CASELESS_RESTRICT is set mid-pattern. In this case, if both CASELESS_RESTRICT and TURKISH_CASING are set, then Turkish is totally deactivated as long as CASELESS_RESTRICT is in force.

I have updated the docs; but I need to update the tests to match now...

NWilson commented 1 month ago

Tests updated. Surely we're done now...? I hope!

carenas commented 1 month ago

Surely we're done now...?

I think the extra restrictions put up on (?r) were unnecessarily harsh so have them prtially reverted in a couple of followup commits in my fork of this so it matches Perl better (details in the commit messages themselves).

we are missing documentation on the interaction between (?:r) and PCRE2_EXTRA_TURKISH_CASING, but the only place were such documentation could be added in pcre2pattern.3 is already convoluted enough without my help so I gave up on a patch.

rebasing the whole thing on top of HEAD and squashing all commits with a nice descriptive commit message also missing.

NWilson commented 1 month ago

OK, thanks Carlo.

I don't really see that we need to support CASELESS_RESTRICT when neither UTF nor UCP is set - it does absolutely nothing in this case - but if you want it, fine.

I've accepted your two commits, and rebased & squashed.

zherczeg commented 1 month ago

This patch looks good, maybe we should land it, and discuss / fix the remaining issues later.

PhilipHazel commented 1 month ago

I don't reallyI don't really see that we need to support CASELESS_RESTRICT when neither UTF nor UCP is set - it does absolutely nothing in this case - but if you want it, fine.

I think the use case is when the application sets CASELESS_RESTRICT and the creator of the pattern uses (UTF) or (UCP).

I will do the merge and have a play.

PhilipHazel commented 1 month ago

Does anybody understand why the scorecards action is now failing?

zherczeg commented 1 month ago

Online services are not available. It says:

429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. https://github.com/PCRE2Project/pcre2/actions/runs/11331094692/job/31510350495

Ignore this message.

PhilipHazel commented 1 month ago

I will ignore the message, but we now have a red cross on the Actions tab. I am not sure who set up this action, or exactly what it does.

zherczeg commented 1 month ago

Me neither. It says it is about security risks:

https://github.com/PCRE2Project/pcre2/blame/master/.github/workflows/scorecards.yml https://github.com/PCRE2Project/pcre2/commit/77ce1ff528edd1aac31772cbee1f19c069db32c7 https://openssf.org/projects/scorecard/

carenas commented 1 month ago

It will probably work again in the next merge, it was just rate limited this time.

PCRE2Project / pcre2

Add support for Turkish 'i' casing #521