PCRE2Project / pcre2

PCRE2 development is now based here.
Other
918 stars 191 forks source link

Bug in char class handling, in caseless mode, with UCP (caused by PR 474) #526

Open alexdowad opened 1 month ago

alexdowad commented 1 month ago

I've bisected this one and found that behavior was changed by #474:

/[a\x{c1}]/iI,ucp
------------------------------------------------------------------
  0  36 Bra
  3     [Aa\xc1]
 36  36 Ket
 39     End
------------------------------------------------------------------
Capture group count = 0
Options: caseless ucp
Starting code units: A a \xc1
Subject length lower bound = 1
    \x{e1}
No match

/[a\x{c1}]/iI,ucp,no_start_optimize
------------------------------------------------------------------
  0  36 Bra
  3     [Aa\xc1]
 36  36 Ket
 39     End
------------------------------------------------------------------
Capture group count = 0
Options: caseless no_start_optimize ucp
Optimizations: auto_possess,dotstar_anchor
    \x{e1}
No match

Before #474, \xe1 would be included in the starting code units when \xc1 was in a leading char class, and caseless mode was enabled. Now it is not.

Further, such a char class no longer matches \xe1, whereas previously it did. This is not just the starting code point optimization causing early failure of the match... see the above test case with no_start_optimize.

For reference, these are the results on the commit just before #474:

/[a\x{c1}]/iI,ucp
------------------------------------------------------------------
  0  36 Bra
  3     [Aa\xc1\xe1]
 36  36 Ket
 39     End
------------------------------------------------------------------
Capture group count = 0
Options: caseless ucp
Starting code units: A a \xc1 \xe1
Subject length lower bound = 1
    \x{e1}
 0: \xe1

/[a\x{c1}]/iI,ucp,no_start_optimize
------------------------------------------------------------------
  0  36 Bra
  3     [Aa\xc1\xe1]
 36  36 Ket
 39     End
------------------------------------------------------------------
Capture group count = 0
Options: caseless no_start_optimize ucp
Optimizations: auto_possess,dotstar_anchor
    \x{e1}
 0: \xe1

Attention please, @zherczeg.

zherczeg commented 1 month ago

Nice! Can you fix it or need my help? In 8 bit mode we don't run range computation, it looks like it is needed if ucp is set.

carenas commented 1 month ago

In 8 bit mode we don't run range computation, it looks like it is needed if ucp is set.

ucp shouldn't be relevant for this IMHO (if this test was indeed 8-bit) and is probably a red herring (indeed the expression seems "wrong", was it fuzzer generated?).

Latin-1 does have case equivalence between \xc1 and \xe1 among other characters that are likely also affected, and the UCP versions with the same codepoints do as well.

carenas commented 1 month ago

FWIW, when I meant "wrong" it is because the syntax used is "ambiguous", humans will likely avoid them by using the right escaping to differentiate between bytes and characters as shown by:

Perl v5.38.2
  re> /[a\xc1]/i
data> \xc1
 0: \xc1
data> \xe1
No match
data> 
  re> /[a\x{c1}]/iu
data> a
 0: a
data> \x{c1}
 0: \xc1
data> \x{e1}
 0: \xe1

This points to possibly two bugs, as well.

Locale: en_US.ISO8859-1
Perl v5.38.2
  re> /[a\xc1]/i
data> \xc1
 0: \xc1
data> \xe1
 0: \xe1
alexdowad commented 1 month ago

Nice! Can you fix it or need my help? In 8 bit mode we don't run range computation, it looks like it is needed if ucp is set.

Dear Zoltan, if I can find some time in the next couple of days I will give it a try.

alexdowad commented 1 month ago

ucp shouldn't be relevant for this IMHO and is probably red herring (indeed the expression seems "wrong", was it fuzzer generated?).

No, this one was not fuzzer-generated.

alexdowad commented 1 month ago

ucp shouldn't be relevant for this IMHO and is probably red herring

Hmm, it appears that when ucp is not used, the behavior before and after #474 is the same. But with ucp, it is different. That's what I have found thus far, but if I discover anything more, I will post it here.

carenas commented 1 month ago

I apologize for not being able to communicate clearly, but the point I was making is that "UCP" doesn't make sense in the 8 bit library and therefore this is not a regression but a fix (unless the issue was triggered with 8 bit characters, which it is not).

the problem really is that the syntax for escaping characters is just ambiguous. Perl assumes that you meant a wide char (it does warn though sometimes), PCRE2 didn't knew what to do (because UTF mode is not enabled) and assumed it was a byte, but mistakenly assigned it the SAME character attributes that the wide character with the same ord has, and that it knew because of UCP.

\xc1 case equivalence with \xe1 is ONLY valid in the 8-bit library when using LATIN-1. To make things more confusing in this specific case \x{c1} and \x{e1} are ALSO case equivalent with UTF-8, but of course the byte encoding that the engine sees is different than the other pair of bytes.

zherczeg commented 1 month ago
00C1;LATIN CAPITAL LETTER A WITH ACUTE;Lu;0;L;0041 0301;;;;N;LATIN CAPITAL LETTER A ACUTE;;;00E1;

Am I understand correctly, that when the plain 8 bit library is used, casing should happen according to tables instead of ucp regardless of ucp? Unfortunately I am still confused.

carenas commented 1 month ago

casing should happen according to tables instead of ucp regardless of ucp?

no, the option of using ucp or tables should be respected if it was requested.

but ucp should be used for any ordinals over 127 only IF utf is enabled, because then we know that we are not looking at a single byte.

PCRE2 version 10.45-DEV 2024-06-09 (8-bit)
  re> /\xc1/BIi,ucp,utf
------------------------------------------------------------------
        Bra
     /i \x{c1}
        Ket
        End
------------------------------------------------------------------
Capture group count = 0
Options: caseless ucp utf
Starting code units: \xc3
Subject length lower bound = 1

Note that in that case, even with the "wrong" syntax for the escape it was clear to PCRE2 that we are talking about a character and the starting code units optimization reflects the correct byte that it would use for both caseless characters.

it is a little more painful, because Perl does allow random bytes even in UTF mode and allows matching them unlike PCRE2, resolving the ambiguity by using the "/u" modifier that does both UTF and UCP, and doing on the fly re-encoding of strings into UTF if needed:

% ./perltest.sh -w -utf8
Perl v5.38.2
  re> /\xc1/i
data> \xc1
 0: \xc1
data> \xe1
No match
carenas commented 1 month ago

After looking at Perl behaviour further, I agree this is a bug, not a feature, eventhough the behaviour we might had been copying from is terribly inconsistent but sadly also too ingrained to be corrected without serious backward compatibility issues.

alexdowad commented 1 month ago

Carlo, thank you for studying this issue. Personally, I still need to understand it better.

carenas commented 1 month ago

Since your test should reflect compatible behaviour with Perl it should be in testinput4, we should also try to add probably more tests to avoid further regressions IMHO.