pcre2test: tighten \x{...} parsing in subject

PCRE2Project / pcre2

PCRE2 development is now based here.

Other

917 stars 191 forks source link

pcre2test: tighten \x{...} parsing in subject #504

Closed carenas closed 1 month ago

carenas commented 1 month ago

Address an oddity I found while accidentally making a typo of \x{100 in pcre2test and that resulted in an unexpected match and diverging results from perltest.

Additionally fix the handling of overlong numbers as shown by:

PCRE2 version 10.44 2024-06-07 (8-bit)
  re> /\D/
data> \x{1234567890}
** Too many hex digits in \x{...} item; using only the first eight.
** Character \x{23456780} is greater than 255 and UTF-8 mode is not enabled.
** Truncation will probably give the wrong result.
 0: \x80

zherczeg commented 1 month ago

I think the test wants to convert the utf8 representation to \x{100} as a 16 bit value. Since this is a pcre2test change, it should be harmless.

carenas commented 1 month ago

I think the test wants to convert the utf8 representation to \x{100} as a 16 bit value

That is another oddity of the test, and even more so if you consider that it ALSO hardcodes UTF-8 for the non 8-bit libraries which have a clone of it in testinpu12 and that also make even less sense.

Agree though that it is harmless, but should we keep it?

carenas commented 1 month ago

The test was actually introduced in PCRE 4.0 and the bug was actually:

PCRE version 3.9 02-Jan-2002

  re> /\x{100}{3,4}/8SD
------------------------------------------------------------------
  0  14 Bra 0
  3   1 \xc4
  6     \x80{3}
 10     \x80{,1}
 14  14 Ket
 17     End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 196
Need char = 128
Study returned NULL

which could had been simplified to /\x{100}?/ and had a typo that wasn't even relevant.