Open alexdowad opened 1 month ago
Nice! Can you fix it or need my help? In 8 bit mode we don't run range computation, it looks like it is needed if ucp is set.
In 8 bit mode we don't run range computation, it looks like it is needed if ucp is set.
ucp
shouldn't be relevant for this IMHO (if this test was indeed 8-bit) and is probably a red herring (indeed the expression seems "wrong", was it fuzzer generated?).
Latin-1 does have case equivalence between \xc1
and \xe1
among other characters that are likely also affected, and the UCP versions with the same codepoints do as well.
FWIW, when I meant "wrong" it is because the syntax used is "ambiguous", humans will likely avoid them by using the right escaping to differentiate between bytes and characters as shown by:
Perl v5.38.2
re> /[a\xc1]/i
data> \xc1
0: \xc1
data> \xe1
No match
data>
re> /[a\x{c1}]/iu
data> a
0: a
data> \x{c1}
0: \xc1
data> \x{e1}
0: \xe1
This points to possibly two bugs, as well.
Locale: en_US.ISO8859-1
Perl v5.38.2
re> /[a\xc1]/i
data> \xc1
0: \xc1
data> \xe1
0: \xe1
Nice! Can you fix it or need my help? In 8 bit mode we don't run range computation, it looks like it is needed if ucp is set.
Dear Zoltan, if I can find some time in the next couple of days I will give it a try.
ucp
shouldn't be relevant for this IMHO and is probably red herring (indeed the expression seems "wrong", was it fuzzer generated?).
No, this one was not fuzzer-generated.
ucp
shouldn't be relevant for this IMHO and is probably red herring
Hmm, it appears that when ucp
is not used, the behavior before and after #474 is the same. But with ucp
, it is different. That's what I have found thus far, but if I discover anything more, I will post it here.
I apologize for not being able to communicate clearly, but the point I was making is that "UCP" doesn't make sense in the 8 bit library and therefore this is not a regression but a fix (unless the issue was triggered with 8 bit characters, which it is not).
the problem really is that the syntax for escaping characters is just ambiguous. Perl assumes that you meant a wide char (it does warn though sometimes), PCRE2 didn't knew what to do (because UTF mode is not enabled) and assumed it was a byte, but mistakenly assigned it the SAME character attributes that the wide character with the same ord has, and that it knew because of UCP.
\xc1 case equivalence with \xe1 is ONLY valid in the 8-bit library when using LATIN-1. To make things more confusing in this specific case \x{c1} and \x{e1} are ALSO case equivalent with UTF-8, but of course the byte encoding that the engine sees is different than the other pair of bytes.
00C1;LATIN CAPITAL LETTER A WITH ACUTE;Lu;0;L;0041 0301;;;;N;LATIN CAPITAL LETTER A ACUTE;;;00E1;
Am I understand correctly, that when the plain 8 bit library is used, casing should happen according to tables instead of ucp regardless of ucp? Unfortunately I am still confused.
casing should happen according to tables instead of ucp regardless of ucp?
no, the option of using ucp or tables should be respected if it was requested.
but ucp should be used for any ordinals over 127 only IF utf is enabled, because then we know that we are not looking at a single byte.
PCRE2 version 10.45-DEV 2024-06-09 (8-bit)
re> /\xc1/BIi,ucp,utf
------------------------------------------------------------------
Bra
/i \x{c1}
Ket
End
------------------------------------------------------------------
Capture group count = 0
Options: caseless ucp utf
Starting code units: \xc3
Subject length lower bound = 1
Note that in that case, even with the "wrong" syntax for the escape it was clear to PCRE2 that we are talking about a character and the starting code units optimization reflects the correct byte that it would use for both caseless characters.
it is a little more painful, because Perl does allow random bytes even in UTF mode and allows matching them unlike PCRE2, resolving the ambiguity by using the "/u" modifier that does both UTF and UCP, and doing on the fly re-encoding of strings into UTF if needed:
% ./perltest.sh -w -utf8
Perl v5.38.2
re> /\xc1/i
data> \xc1
0: \xc1
data> \xe1
No match
After looking at Perl behaviour further, I agree this is a bug, not a feature, eventhough the behaviour we might had been copying from is terribly inconsistent but sadly also too ingrained to be corrected without serious backward compatibility issues.
Carlo, thank you for studying this issue. Personally, I still need to understand it better.
Since your test should reflect compatible behaviour with Perl it should be in testinput4, we should also try to add probably more tests to avoid further regressions IMHO.
I've bisected this one and found that behavior was changed by #474:
Before #474,
\xe1
would be included in the starting code units when\xc1
was in a leading char class, and caseless mode was enabled. Now it is not.Further, such a char class no longer matches
\xe1
, whereas previously it did. This is not just the starting code point optimization causing early failure of the match... see the above test case withno_start_optimize
.For reference, these are the results on the commit just before #474:
Attention please, @zherczeg.