Case-insensitive matching of characters above 255 in ranges in character classes doesn't work

p5pRT commented 14 years ago

Migrated from rt.perl.org#71752 (status was 'resolved')

Searchable as RT71752$

p5pRT commented 14 years ago

From @khwilliamson

This is a bug report for perl from khw@khw-desktop.nonet\, generated with the help of perlbug 1.39 running under perl 5.11.3.

perl -E 'say chr(0x0430) =~ /[=\x{0410}]/i' 1

perl -E 'say chr(0x0430) =~ /[=\x{0410}-\x{0411}]/i'

That is\, it didn't match in the second case. (The equal sign is to keep the class from being optimized out.) If the code point is in a range in a character class\, case-insensitive matching doesn't work on it.

Flags: category=core severity=medium

Site configuration information for perl 5.11.3:

Configured by khw at Tue Dec 29 12:45:43 MST 2009.

Summary of my perl5 (revision 5 version 11 subversion 3) configuration: Commit id: 9f815e241cf04d04fc645970753438216a0ed024 Platform: osname=linux\, osvers=2.6.27-16-generic\, archname=i686-linux uname='linux khw-desktop 2.6.27-16-generic #1 smp tue dec 1 17:56:54 utc 2009 i686 gnulinux ' config_args='-s -d -Dprefix=/home/khw/fastbleadperl -Dusedevel' hint=recommended\, useposix=true\, d_sigaction=define useithreads=undef\, usemultiplicity=undef useperlio=define\, d_sfio=undef\, uselargefiles=define\, usesocks=undef use64bitint=undef\, use64bitall=undef\, uselongdouble=undef usemymalloc=n\, bincompat5005=undef Compiler: cc='cc'\, ccflags ='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64'\, optimize='-O2'\, cppflags='-fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion=''\, gccversion='4.3.2'\, gccosandvers='' intsize=4\, longsize=4\, ptrsize=4\, doublesize=8\, byteorder=1234 d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=12 ivtype='long'\, ivsize=4\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8 alignbytes=4\, prototype=define Linker and Libraries: ld='cc'\, ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -ldl -lm -lcrypt -lutil -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc libc=/lib/libc-2.8.90.so\, so=so\, useshrplib=false\, libperl=libperl.a gnulibc_version='2.8.90' Dynamic Linking: dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags='-Wl\,-E' cccdlflags='-fPIC'\, lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector'

Locally applied patches:

@INC for perl 5.11.3: /home/khw/fastbleadperl/lib/site_perl/5.11.3/i686-linux /home/khw/fastbleadperl/lib/site_perl/5.11.3 /home/khw/fastbleadperl/lib/5.11.3/i686-linux /home/khw/fastbleadperl/lib/5.11.3 .

Environment for perl 5.11.3: HOME=/home/khw LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset)

PATH=/home/khw/bin:/home/khw/print/bin:/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/usr/games:/opt/real/RealPlayer:/home/khw/cxoffice/bin PERL_BADLANG (unset) SHELL=/bin/ksh

p5pRT commented 13 years ago

From @cpansprout

(Replying to the message at nntp://nntp.perl.org/4CD5EBE5.7080904@khwilliamson.com that never made its way to RT because the # was omitted from the subject:)

As noted in the comments of the code\, "a" =~ /[A]/i doesn't work currently (except that regcomp.c knows about the ASCII characters and corrects for it\, but not always\, for example in cases like "a" =~ /\p{Upper}/i.) This patch catches all those\, plus the ones above Latin1. A slight amount of more work in regcomp is required to fix the Latin1 ones.

This is probably several years too late\, but should /i really be affecting \p?

If I’m not mistaken\, it does not properly now anyway. Would most people expect /http:\p{ASCII}+/i to match a Kelvin sign?

p5pRT commented 13 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 13 years ago

From @khwilliamson

Father Chrysostomos via RT wrote:

(Replying to the message at nntp://nntp.perl.org/4CD5EBE5.7080904@khwilliamson.com that never made its way to RT because the # was omitted from the subject:)

As noted in the comments of the code\, "a" =~ /[A]/i doesn't work currently (except that regcomp.c knows about the ASCII characters and corrects for it\, but not always\, for example in cases like "a" =~ /\p{Upper}/i.) This patch catches all those\, plus the ones above Latin1. A slight amount of more work in regcomp is required to fix the Latin1 ones.

This is probably several years too late\, but should /i really be affecting \p?

If I’m not mistaken\, it does not properly now anyway. Would most people expect /http:\p{ASCII}+/i to match a Kelvin sign?

The problem is that it's not clear what should happen\, and in places\, perl already implements it the way the patch extends:

perl -E 'say "a" =~ /[[:upper:]]/i' 1

I posed a similar question to the Unicode folks a while ago\, and got no response\, except for me to come up with a proposal to submit to them. I take that to mean that they haven't thought about this before. Someone just recently there suggested posing it to the ICU folks\, which I have now done. The complete text of my message to them is at the end

A problem is that Perl is inconsistent\, as bug #69166 points out. http://rt.perl.org/rt3/Public/Bug/Display.html?id=69166

The author thinks it is reasonable for properties that mention case to not be affected by /i. However this was just an accident of implementation\, as the example above shows. The bug in that ticket is that \P and \p behave inconsistently.

I believe that Tom has recently argued that the direction the patch goes is the correct thing.

I had not considered the idea of not applying /i to \p nor \P. That's an interesting idea. It would take more work to implement than could be done in 5.14. And it is inconsistent with current implementation.

I think that the correct answer is to not apply /i matching to some properties. That is again beyond the 5.14 realm of possibilities; and needs Unicode guidance\, I believe.

Anyway here is what I sent Unicode:

It would be good if TR18 were enhanced with more discussion of case insensitive matching. Chapter 3 of the standard defines the Default Caseless Matching algorithm\, but it applies only to two strings\, and extending it to apply to patterns is not trivial\, and is totally unspecified\, as far as I have seen.

In particular\, the use of a property in a regular expression pattern with caseless matching introduces a number of issues that I don't believe are addressed anywhere in the standard.

For example\, should 'N' =~ /\p{Gc=Lowercase_Letter}/i should 'n' =~ /\p{Gc=Uppercase_Letter/i

I thought the answer was true to both these\, but then\, what about "\N{MICRO SIGN}" =~ /\p{Block=Greek}/i "\N{MICRO SIGN}" =~ /\p{Script=Greek}/i

because the fold of MICRO SIGN is in the Greek block and script? It doesn't seem right to me that a character should match a different script than the one it's in under caseless matching. Similarly\, there are a number of characters whose fold has a different Age\, Soft_Dotted\, East_Asian_Width\, Math\, Decomposition_Type\, Line_Break\, or Full_Composition_Exclusion property value\, besides the ones I would expect\, like Changes_When_Case_Folded\, and General_Category. The YPOGEGRAMMENI\, as always\, introduces even more.

So perhaps caseless matching shouldn't apply to some properties? If so\, which ones should be spelled out. Certainly\, some properties should have caseless matching rules. For example\, I believe\,

"A" =~ /\p{Name=Latin Small Letter A}/i

should match. Here's another example where allowing the property to match any case can lead to problems.

"\N{LATIN SMALL LIGATURE FF}" =~ /\p{ASCII_Hex_Digit=Y}\p{ASCII_Hex_Digit=Y}/i

The pattern seems to indicate that only ASCII digits are desired; yet it could match something non-ASCII\, potentially leading to a spoofing attack.

TR18 is also silent on another issue I've brought up before\, and gotten no response to. A number of languages\, including ICU I believe\, allow for regular expression capture buffers. These allow for saving some portion(s) of the original string that matched some sub-part of the pattern. But when you convert the string into something else for matching\, such as normalizing it\, and then match against that\, and you have capture buffers\, those buffers should return not some portion of the converted string\, but the corresponding portion of the original\, which you may not be able to get back to. This can happen even without normalization if the string folds to more than one character:

"\N{LATIN SMALL LIGATURE FI}" =~ /fi/i

should match\, as should

"\N{LATIN SMALL LIGATURE FI}" =~ /[f][i]/i

Hence\, so should

"\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i

But\, the parentheses mean capture buffers\, and there is no 1-to-1 correspondence between either of these buffers and any atomic part of the string. I don't know what should happen here\, and I think TR18 should address this.

So how do I go about getting someone or someones thinking about these issues to add to TR18?

p5pRT commented 13 years ago

From @abigail

On Sun\, Nov 07\, 2010 at 03:20:13PM -0700\, karl williamson wrote:

Father Chrysostomos via RT wrote:

(Replying to the message at nntp://nntp.perl.org/4CD5EBE5.7080904@khwilliamson.com that never made its way to RT because the # was omitted from the subject:)

As noted in the comments of the code\, "a" =~ /[A]/i doesn't work currently (except that regcomp.c knows about the ASCII characters and corrects for it\, but not always\, for example in cases like "a" =~ /\p{Upper}/i.) This patch catches all those\, plus the ones above Latin1. A slight amount of more work in regcomp is required to fix the Latin1 ones.

This is probably several years too late\, but should /i really be affecting \p?

IMO\, yes. I've always seen \p{Foo} just a handy way of writing [...]\, with all the characters having the property 'Foo' listed inside the brackets.

If I’m not mistaken\, it does not properly now anyway. Would most people expect /http:\p{ASCII}+/i to match a Kelvin sign?

I would not. But then\, if you replace \p{ASCII} with [\x00-\x7F]\, I would not expect it to match a Kelvin sign either.

The problem is that it's not clear what should happen\, and in places\,
perl already implements it the way the patch extends:

perl -E 'say "a" =~ /[[:upper:]]/i' 1

Actually\, that's exactly what I would (and always have) expect to happen. I would consider [[:upper:]] just a different way of writing [A-Z]. And I do expect

"a" =~ /[A-Z]/i

to match.

I think it would be highly confusing if you have something like:

my $re = somepattern ();

and have

$str =~ /$re/;

to be true\, but

lc ($str) =~ /$re/i;

to be false.

p5pRT commented 13 years ago

From @cpansprout

On Nov 7\, 2010\, at 2:20 PM\, karl williamson wrote:

"\N{LATIN SMALL LIGATURE FF}" =~ /\p{ASCII_Hex_Digit=Y}\p{ASCII_Hex_Digit=Y}/i

The pattern seems to indicate that only ASCII digits are desired; yet it could match something non-ASCII\, potentially leading to a spoofing attack.

For that reason\, I don’t think it should match.

TR18 is also silent on another issue I've brought up before\, and gotten no response to. A number of languages\, including ICU I believe\, allow for regular expression capture buffers. These allow for saving some portion(s) of the original string that matched some sub-part of the pattern. But when you convert the string into something else for matching\, such as normalizing it\, and then match against that\, and you have capture buffers\, those buffers should return not some portion of the converted string\, but the corresponding portion of the original\, which you may not be able to get back to. This can happen even without normalization if the string folds to more than one character:

"\N{LATIN SMALL LIGATURE FI}" =~ /fi/i

should match\, as should

"\N{LATIN SMALL LIGATURE FI}" =~ /[f][i]/i

Or maybe not. See below.

Hence\, so should

"\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i

But\, the parentheses mean capture buffers\, and there is no 1-to-1 correspondence between either of these buffers and any atomic part of the string. I don't know what should happen here\, and I think TR18 should address this.

That’s easy: parentheses get anchored to character boundaries\, as should the beginning and end of the regular expression.

But then if "ﬁ" =~ /(f)(i)/i matches\, then why not /(.){2}/i? If that matches as well\, should it still match /^.\z/i?

Supporting multi-char folding would cause all sorts of surprises\, so I would suggest we put it under a new flag (/f for foldcase or funny mode) or\, better yet\, in a separate module. One would have to normalise the string\, such that /^.\z/i does not match\, if things are to be at all consistent (and\, hence\, usable).

Allowing "ß" =~ /ss/i to match was a mistake\, if you ask me\, but not one that can be undone. But we can avoid propagating it further. So the solution is to leave things as they are.

BTW\, ECMAScript explicitly disallows multi-char folding\, which seems like the cleanest solution.

p5pRT commented 13 years ago

From @cpansprout

Fixed by 2726813d9.

p5pRT commented 13 years ago

@cpansprout - Status changed from 'open' to 'resolved'

p5pRT commented 13 years ago

From @khwilliamson

demerphq wrote:

On 7 November 2010 23:20\, karl williamson \public@khwilliamson\.com wrote:

TR18 is also silent on another issue I've brought up before\, and gotten no response to. A number of languages\, including ICU I believe\, allow for regular expression capture buffers. These allow for saving some portion(s) of the original string that matched some sub-part of the pattern. But when you convert the string into something else for matching\, such as normalizing it\, and then match against that\, and you have capture buffers\, those buffers should return not some portion of the converted string\, but the corresponding portion of the original\, which you may not be able to get back to.

It seems to me the problem is not so much that you cant get back to it\, but rather that it is inefficient and inconvenient to do so.

cheers\, Yves

I was speaking without the restriction that capturing anchors to character boundaries. Given that restriction\, it may be able to get back to it.

The problem here is that we have the Heisenberg uncertainty principle. Someone writes a regex that matches that they don't think should. They then add capturing parentheses to find out where it's matching\, and it stops matching. Then they write a bug report.

p5pRT commented 13 years ago

From @khwilliamson

Father Chrysostomos wrote:

On Nov 7\, 2010\, at 2:20 PM\, karl williamson wrote:

"\N{LATIN SMALL LIGATURE FF}" =~ /\p{ASCII_Hex_Digit=Y}\p{ASCII_Hex_Digit=Y}/i

The pattern seems to indicate that only ASCII digits are desired; yet it could match something non-ASCII\, potentially leading to a spoofing attack.

For that reason\, I don’t think it should match.

I was thinking of withdrawing my patch\, thinking it was making a security problem worse\, but it's already bad\, so I'm not making it noticeably worse. /\p{ASCII}/i already matches non-ASCII in 5.12\, and likely earlier.

TR18 is also silent on another issue I've brought up before\, and gotten no response to. A number of languages\, including ICU I believe\, allow for regular expression capture buffers. These allow for saving some portion(s) of the original string that matched some sub-part of the pattern. But when you convert the string into something else for matching\, such as normalizing it\, and then match against that\, and you have capture buffers\, those buffers should return not some portion of the converted string\, but the corresponding portion of the original\, which you may not be able to get back to. This can happen even without normalization if the string folds to more than one character:

"\N{LATIN SMALL LIGATURE FI}" =~ /fi/i

should match\, as should

"\N{LATIN SMALL LIGATURE FI}" =~ /[f][i]/i

Or maybe not. See below.

Hence\, so should

"\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i

But\, the parentheses mean capture buffers\, and there is no 1-to-1 correspondence between either of these buffers and any atomic part of the string. I don't know what should happen here\, and I think TR18 should address this.

That’s easy: parentheses get anchored to character boundaries\, as should the beginning and end of the regular expression.

This position has garnered some support\, but as I responded in another post\, it creates Heisenberg problems\, things that look like bugs\, where something matches until you try to find what.

But then if "ﬁ" =~ /(f)(i)/i matches\, then why not /(.){2}/i? If that matches as well\, should it still match /^.\z/i?

Supporting multi-char folding would cause all sorts of surprises\, so I would suggest we put it under a new flag (/f for foldcase or funny mode) or\, better yet\, in a separate module. One would have to normalise the string\, such that /^.\z/i does not match\, if things are to be at all consistent (and\, hence\, usable).

Allowing "ß" =~ /ss/i to match was a mistake\, if you ask me\, but not one that can be undone. But we can avoid propagating it further. So the solution is to leave things as they are.

I don't understand this. Perl already matches all the multi-fold characters sometimes. But it's very inconsistent as to what get matched when.

I'd love to withdraw our (buggy) support of multi-char folds\, but it's not in the cards\, I believe. Jan suggested a different module a while ago. I don't know how to implement it fully\, without major revisions. As I have said\, the code assumes nodes correspond to character boundaries\, so "ß" =~ /[s_][s_]/i doesn't match\, and I can't think how to make it so.

There is also an issue\, though solvable\, with the /f or separate module. Unicode furnishes separate mappings in some cases for applications that don't want to use multi-char folds. Perl would have to then know about two sets of mappings.

THe response I got from ICU about matching /\p{}/i and multi-char folds did not address the question\, and instead asked for suggestions from me. Again\, that indicates they haven't thought through these issues.

BTW\, ECMAScript explicitly disallows multi-char folding\, which seems like the cleanest solution.

p5pRT commented 13 years ago

From @cpansprout

On Nov 12\, 2010\, at 12:56 PM\, karl williamson wrote:

Father Chrysostomos wrote:

That’s easy: parentheses get anchored to character boundaries\, as should the beginning and end of the regular expression.

This position has garnered some support\, but as I responded in another post\, it creates Heisenberg problems\, things that look like bugs\, where something matches until you try to find what.

But it would only happen under /f\, right? In that case they are asking for it explicitly.

If the regexp engine (or a pluggable engine) is made capable of anchoring nodes to character boundaries at times and not at others\, then there is no reason we should not allow users to specify that explicitly with an \e (edge) escape.

That would also make documentation easier: Just state (in the section for the /f flag or re::engine::foldcase) that capturing parentheses imply \e.

But then if "ﬁ" =~ /(f)(i)/i matches\, then why not /(.){2}/i? If that matches as well\, should it still match /^.\z/i? Supporting multi-char folding would cause all sorts of surprises\, so I would suggest we put it under a new flag (/f for foldcase or funny mode) or\, better yet\, in a separate module. One would have to normalise the string\, such that /^.\z/i does not match\, if things are to be at all consistent (and\, hence\, usable). Allowing "ß" =~ /ss/i to match was a mistake\, if you ask me\, but not one that can be undone. But we can avoid propagating it further. So the solution is to leave things as they are.

I don't understand this. Perl already matches all the multi-fold characters sometimes. But it's very inconsistent as to what get matched when.

I'd love to withdraw our (buggy) support of multi-char folds\, but it's not in the cards\, I believe.

What I was trying to say is that we should either support multi-char folding properly or not at all. Now\, as you state\, that is not a choice. But I was trying to provide a compromise. If we support it fully\, then "ﬁ" =~ /^.\z/ will stop matching\, which was why I suggested a separate flag (or module). In that case\, the normal regexp engine would have to stay as it is. We can simply document the two cases affected:

character class matches multiple characters: "ss" =~ /^[ßγ]\z/i literal strings in the pattern: "ssß" =~ /ßss/i

But then we have the problem of [ı] -> ı optimisation....

Jan suggested a different module a while ago. I don't know how to implement it fully\, without major revisions. As I have said\, the code assumes nodes correspond to character boundaries\, so "ß" =~ /[s_][s_]/i doesn't match\, and I can't think how to make it so.

There is also an issue\, though solvable\, with the /f or separate module. Unicode furnishes separate mappings in some cases for applications that don't want to use multi-char folds. Perl would have to then know about two sets of mappings.

Unless the foldcase mode is a pluggable engine. (I rather like the /f idea\, though\, even though I disliked it at first and was simply trying to come up with a solution.)

THe response I got from ICU about matching /\p{}/i

What Abigail said about \p{} has convinced me. It’s just shorthand for [lots of chars].

and multi-char folds did not address the question\, and instead asked for suggestions from me. Again\, that indicates they haven't thought through these issues.

p5pRT commented 13 years ago

From tchrist@perl.com

I *really* keep wondering whether what people really want is for both their regex and the string it's used against to be (able to be) NFD'd or NFKD'd. Doesn't that fall under Level 2 Unicode compliance per UTS#18 2.1?

I know Java Patterns have an optional CANON_EQ compilation flag\, which is considered costly. The beginning of their Pattern.compile() reads:

/** * Copies regular expression to an int array and invokes the parsing * of the expression which will create the object tree. */ private void compile() { // Handle canonical equivalences if (has(CANON_EQ) && !has(LITERAL)) { normalize(); } else { normalizedPattern = pattern; } patternLength = normalizedPattern.length();

And Pattern.normalize() begins:

/** * The pattern is converted to normalizedD form and then a pure group * is constructed to match canonical equivalences of the characters. */ private void normalize() { boolean inCharClass = false; int lastCodePoint = -1;

// Convert pattern into normalizedD form normalizedPattern = Normalizer.normalize(pattern\, Normalizer.Form.NFD); patternLength = normalizedPattern.length();

// Modify pattern to match canonical equivalences StringBuilder newPattern = new StringBuilder(patternLength); for(int i=0; i\<patternLength; ) { int c = normalizedPattern.codePointAt(i); StringBuilder sequenceBuffer; if ((Character.getType(c) == Character.NON_SPACING_MARK)

I certainly find myself running NFD() on things when they come it pretty often\, although that's not good enough for the "ffi" situation; you need NFKD() for that. Notice that Java does only NFD here.

Also\, come match time\, it does *not* check whether CANON_EQ has been set in the pattern to also apply the same normalization to the string. Seems to me that's something you would want to do if you were going that route.

The thing is\, people would also like to be able to match things that fall into the style of POSIX equivalence classes\, or Unicode Collate equivalences to one level or the other. It's really very unpleasant to do that with the module\, and I do wish some way to make this work in the regex engine could be found.

To go with the pony. :(

--tom

p5pRT commented 13 years ago

From @khwilliamson

Tom Christiansen wrote:

I *really* keep wondering whether what people really want is for both their regex and the string it's used against to be (able to be) NFD'd or NFKD'd. Doesn't that fall under Level 2 Unicode compliance per UTS#18 2.1?

Certainly\, it would be desirable for perl to do either of these things. But it is a major effort.

I'm still not convinced that causing capturing parentheses to anchor matches at word boundaries is the right solution. It does solve the partial match problem of not having something to return\, but it does it so by invoking Heisenberg so that just adding parentheses could cause a match to fail that previously succeeded. That just doesn't seem right.

And\, as I've said recently\, the notion that a node can match a partial character is alien to the regex engine. Without the decomposition normalizations that problem is restricted to case insensitive matching of characters that have multi-character folds. With the D normalizations\, the problem starts happening everywhere.

I'd love to be shown wrong about this all\, and that if I were just more clever\, things would work out fine\, without doing a major rewrite. But even ICU\, Unicode's flagship implementation\, hasn't implemented even the NFD rule\, and when I wrote them about some of these issues\, I got back a response asking for suggestions from me.

Perl / perl5