Closed NWilson closed 1 month ago
If I understand correctly, the Turkish casing removes the i == I caseless rule, that is what it makes it incompatible with other character sets. It also adds two new rules: 'i' == U+0130 and U+0131 == I. Is this correct? I is 0x49 and i is 0x69.
Yes, exactly correct.
The most-annoying thing is that even infects the ASCII fast-path processing, not just the slower UCD record access.
Thankfully, there aren't loads of places where PCRE2 needs to worry about casing of characters. And some of them don't need to be updated - for example, we don't need change the idea of casing used by the "first code-unit" support.
What I think we want:
So the Turkish (and ASCII-restrict) casing should really only affect whether characters are compiled to standard or special-cased opcodes (...and also the OP_REFI impact). That's not too bad, it's relatively restricted and clear.
I've largely only had to add Turkish pollution to chunks of code that already had ASCII-caseless-restrict pollution.
I just noticed something. It seems the turkish flag overwrites the restricted flag. Is this correct?
No, it shouldn't do. They ought to work together fine.
CaseFolding.txt describes the case-equivalences between characters. CASELESS_RESTRICT unlinks the equivalences (from that file) which cross the ASCII boundary; it only affects the "k/Kelvin" and "s/Long-S" equivalences. TURKISH_CASING sets new equivalences for the 4 "i" characters.
I should add tests for them both at once. It looks like I forgot to do that.
Btw, when both restricted and turkish are enabled, the i
and I
will simply not match, is this correct? The new tests should test both classes and simple characters, not simple characters. More things can be done in a single test: /[a-z][^i]I/ucp,turkish,restricted
Btw, when both restricted and turkish are enabled, the
i
andI
will simply not match, is this correct?
That is what I suggested as well but couldn't convince Nick; the documentation isn't clear about the expected behaviour (which I find confusing) either:
re> /[\x{00FF}-\x{FFEE}]/ir,utf,turkish_casing
data> i
0: i
data> I
0: I
And could even qualify as a regression for #11.
How the two flags should work together?
I've added documentation in the pcre2unicode
doc page.
Zoltan, using the "PT_CLIST" for the Turkish I characters causes an assertion failure in SLJIT:
SLJIT_ASSERT(other_cases[0] != NOTACHAR && other_cases[1] != NOTACHAR && other_cases[2] != NOTACHAR);
Maybe now would be the time to introduce OP_2CHAR... although that would mean even more JIT work to implement it (but fairly simple at least, I hope).
I've also tried to make the change in compile_ref_matchingpath()
but struggling with the JIT syntax.
Let me do more free work for microsoft: https://github.com/zherczeg/pcre2/commit/090f194b877e289044790e8775f3ffa8d8159cf6
Btw, it would be good to test ucp (no utf) caseless turkish backreferences. Is there tests for turkish + restricted?
Picking up on a couple of points:
As far as (?r) goes, we could introduce (*CASELESSRESTRICT) without causing any incompatibility. At the same time, the use of (?r) could be deprecated for a release or two before removal. I agree with NIck (and therefore disagree with Carlo) that the need for this to vary in different parts of the pattern is minimal. However_, I note that the nearest thing Perl has to (?r) is (?aa) which can be varied throughout the pattern. This output is from perltest.sh:
Perl v5.40.0
/k(?aa)k/i
\x{212a}k
0: \x212ak
\x{212a}\x{212a}
No match
Perl's /aa is not the same as PCRE2's /r because /aa also makes \d etc into ASCII-only items. That must have been why I implemented (?aD) etc in PCRE2 as separate switches. Perhaps we should change (?r) into (?aR) and (?t) into (?aT), though I'm just throwing out ideas here. Oops, no, (?aT) is already taken. (?aI) is free.... This is all a mess.
As far as EBCDIC is concerned, the code should compile in two different ways in an EBCDIC environment: (a) dealing with 8-bit EBCDIC as input and (b) reading Unicode encoded as UTF-8 as input. This was requested for the processing of UTF-8 files in an EBCDIC environment. In case (a) there is clearly no need for ASCII/Unicode support. As I understand it, IBM's z/OS has a native environment, which is what Ze'ev Atlas works in, but also "Unix system services" which "allows UNIX applications from other platforms to run on IBM System z mainframes running z/OS" (Wikipedia). In theory, there is a small amount of testing that can be done by compiling PCRE2 --with-ebcdic on an ASCII system, but it is many years since I tried it.
(b) reading Unicode encoded as UTF-8 as input.
but there is no UTF-8 in EBCDIC systems, are you sure it is not UTF-EBCDIC which is what Perl supports with use utf8
there?
if what they were actually using is UTF-16, then the code is correct because it is compatible with UTF-8, though and uses the same scalars for all 4 characters, but if it is the other UTF which seems to match better with EBCDIC CCSID 1026 then at least U+69 will be encoded as x89 and U+131 (AKA dotless i) would be in x79 (which are still 1 byte encodings) even with "UTF"
You certainly can have UTF-8 files on an EBCDIC system. There's nothing magic about an EBCDIC system that stops you storing any binary on it. I think in UTF mode on an EBCDIC system, PCRE2 appears to just do... use normal UTF.
I have:
(?t)
inline option, and updated the documentation(*TURKISH_CASING)
There is one test case that's failing for the JIT, which I'll have a go fixing in the morning tomorrow.
Otherwise, I think this is basically good to go.
It looks like my incorrect suggestion was causing the failure:
--- a/src/pcre2_jit_compile.c
+++ b/src/pcre2_jit_compile.c
@@ -9525,12 +9525,13 @@ if ((common->utf || common->ucp) && (*cc == OP_REFI || *cc == OP_DNREFI))
OP2(SLJIT_ADD, TMP1, 0, TMP1, 0, TMP3, 0);
CMPTO(SLJIT_EQUAL, TMP1, 0, char1_reg, 0, loop);
- if (refi_flag & REFI_FLAG_CASELESS_RESTRICT)
- add_jump(compiler, &no_match, CMP(SLJIT_LESS, char1_reg, 0, SLJIT_IMM, 128));
add_jump(compiler, &no_match, CMP(SLJIT_EQUAL, TMP2, 0, SLJIT_IMM, 0));
OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 2);
OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, SLJIT_IMM, (sljit_sw)PRIV(ucd_caseless_sets));
+ if (refi_flag & REFI_FLAG_CASELESS_RESTRICT)
+ add_jump(compiler, &no_match, CMP(SLJIT_LESS | SLJIT_32, SLJIT_MEM1(TMP2), 0, SLJIT_IMM, 128));
+
caseless_loop = LABEL();
OP1(SLJIT_MOV_U32, TMP1, 0, SLJIT_MEM1(TMP2), 0);
OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, SLJIT_IMM, sizeof(uint32_t));
The first character of the caseset needs to be checked for ascii, not the current character. The test was good to reveal it! Thank you!
I have updated the CASELESS_RESTRICT and TURKISH_CASING behaviour, as requested.
Finally, these checks are not triggered if CASELESS_RESTRICT is set mid-pattern. In this case, if both CASELESS_RESTRICT and TURKISH_CASING are set, then Turkish is totally deactivated as long as CASELESS_RESTRICT is in force.
I have updated the docs; but I need to update the tests to match now...
Tests updated. Surely we're done now...? I hope!
Surely we're done now...?
I think the extra restrictions put up on (?r) were unnecessarily harsh so have them prtially reverted in a couple of followup commits in my fork of this so it matches Perl better (details in the commit messages themselves).
we are missing documentation on the interaction between (?:r)
and PCRE2_EXTRA_TURKISH_CASING
, but the only place were such documentation could be added in pcre2pattern.3
is already convoluted enough without my help so I gave up on a patch.
rebasing the whole thing on top of HEAD and squashing all commits with a nice descriptive commit message also missing.
OK, thanks Carlo.
I don't really see that we need to support CASELESS_RESTRICT
when neither UTF nor UCP is set - it does absolutely nothing in this case - but if you want it, fine.
I've accepted your two commits, and rebased & squashed.
This patch looks good, maybe we should land it, and discuss / fix the remaining issues later.
I don't reallyI don't really see that we need to support CASELESS_RESTRICT when neither UTF nor UCP is set - it does absolutely nothing in this case - but if you want it, fine.
I think the use case is when the application sets CASELESS_RESTRICT and the creator of the pattern uses (UTF) or (UCP).
I will do the merge and have a play.
Does anybody understand why the scorecards action is now failing?
Online services are not available. It says:
429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. https://github.com/PCRE2Project/pcre2/actions/runs/11331094692/job/31510350495
Ignore this message.
I will ignore the message, but we now have a red cross on the Actions tab. I am not sure who set up this action, or exactly what it does.
It will probably work again in the next merge, it was just rate limited this time.
This code adds many specializations, and I am still thinking how could we optimize it.
If I understand correctly, the Turkish casing removes the
i
==I
caseless rule, that is what it makes it incompatible with other character sets. It also adds two new rules: 'i' ==U+0130
andU+0131
==I
. Is this correct?I
is 0x49 andi
is 0x69.Their choices for constants does makes it hard. Maybe some macros could help.