[[:xdigit:]] + case-insensitive regexes + multicharacter unicode codepoints breaks

p5pRT commented 15 years ago

Migrated from rt.perl.org#64838 (status was 'resolved')

Searchable as RT64838$

p5pRT commented 15 years ago

From jagerman@jagerman.com

Created by jagerman@jagerman.com

I ran into a strange problem where using [[:xdigit:]] in a case-insensitive regex with multi-character unicode characters such as "U+FB02 LATIN SMALL LIGATURE FL" matches when it shouldn't.

Test case:

perl -Mutf8 -wle ' my $fl = "ﬂ"; print "Length: " . length($fl); print "Upper-case: " . uc($fl); print "Hex digit: " . ($fl =~ /^[[:xdigit:]]$/ ? "Yes" : "No"); print "Case-insensitive hex digit: " . ($fl =~ /^[[:xdigit:]]$/i ? "Yes" : "No");'

This prints:

Length: 1 Upper-case: FL Hex digit: No Case-insensitive hex digit: Yes

The last output is clearly wrong.

Changing the [[:xdigit:]]s to [0-9a-fA-F] corrects the problem.

Perl Info

``` Flags: category=core severity=low Site configuration information for perl 5.10.0: Configured by jagerman at Sat Jun 7 16:42:25 ADT 2008. Summary of my perl5 (revision 5 version 10 subversion 0) configuration: Platform: osname=darwin, osvers=9.3.0, archname=darwin-thread-multi-ld-2level uname='darwin whitebat.local 9.3.0 darwin kernel version 9.3.0: fri may 23 00:49:16 pdt 2008; root:xnu-1228.5.18~1release_i386 i386 i386 macbook3,1 darwin ' config_args='-d' hint=previous, useposix=true, d_sigaction=define useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=define, use64bitall=define, uselongdouble=define usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-arch x86_64 -fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -I/usr/local/include', optimize='-O3', cppflags='-no-cpp-precomp -arch x86_64 -fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -I/usr/local/include -fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -I/usr/local/include -arch x86_64 -fno-common -DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -pipe -I/usr/local/include' ccversion='', gccversion='4.0.1 (Apple Inc. build 5465)', gccosandvers='' intsize=4, longsize=8, ptrsize=8, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16 ivtype='long', ivsize=8, nvtype='long double', nvsize=16, Off_t='off_t', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='env MACOSX_DEPLOYMENT_TARGET=10.5 cc', ldflags ='-arch x86_64 -L/usr/local/lib' libpth=/usr/local/lib /usr/lib libs=-ldbm -ldl -lm -lutil -lc perllibs=-ldl -lm -lutil -lc libc=/usr/lib/libc.dylib, so=dylib, useshrplib=false, libperl=libperl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' ' cccdlflags=' ', lddlflags='-arch x86_64 -bundle -undefined dynamic_lookup -L/usr/local/lib' Locally applied patches: @INC for perl 5.10.0: /opt/perl/perl-5.10.0-64bit/lib/5.10.0/darwin-thread-multi-ld-2level /opt/perl/perl-5.10.0-64bit/lib/5.10.0 /opt/perl/perl-5.10.0-64bit/lib/site_perl/5.10.0/darwin-thread-multi-ld-2level /opt/perl/perl-5.10.0-64bit/lib/site_perl/5.10.0 . Environment for perl 5.10.0: DYLD_LIBRARY_PATH (unset) HOME=/Users/jagerman LANG=en_CA.UTF-8 LANGUAGE (unset) LC_COLLATE=en_CA.UTF-8 LC_CTYPE=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 LC_MONETARY=en_CA.UTF-8 LC_NUMERIC=en_CA.UTF-8 LC_TIME=en_CA.UTF-8 LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/usr/texbin:/Applications/Gimp.app/Contents/Resources/bin:/opt/perl/perl-5.10.0-64bit/bin:/usr/local/mysql/bin:~/bin:/opt/ghc/bin:/usr/local/bin:/sw/bin:/sw/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/X11/bin:/usr/X11R6/bin:/opt/local/bin:/opt/local/sbin PERL_BADLANG (unset) PERL_UNICODE=SDL SHELL=/bin/bash ```

p5pRT commented 15 years ago

From @andk

On Sat\, 18 Apr 2009 10:27:52 -0700\, Jason Rhinelander (via RT) \perlbug\-followup@perl\.org said:

> The last output is clearly wrong.

Bug confirmed in bleadperl. The short and seven bit version I used was:

perl -e 'print "\x{fb02}" =~ /^[[:xdigit:]]$/i ? "not ok\n" : "ok\n"'

-- andreas

p5pRT commented 15 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 15 years ago

From tchrist@perl.com

On Sat\, 18 Apr 2009 10:27:52 -0700\, Jason Rhinelander (via RT) \<pe\

The last output is clearly wrong.

Bug confirmed in bleadperl. The short and seven bit version I used was:

perl -e 'print "\x{fb02}" =~ /^[[:xdigit:]]$/i ? "not ok\n" : "ok\n"'

I've been playing with this.

We're into tricky folding issues here.

What appears to be going on is that the special-casing rules in lib/unicore/SpecialCasing.txt are kicking in\, rendering what appears to be compatible (K) not a canonical decomposition form\, then applying the property test to only the first of the decomposed results\, rather than to all of them. The engine then skips over to the next original character\, missing the remaining decomposed ones.

You can also trigger the problem with other code points\, like U+1E9A\, which is LATIN SMALL LETTER A WITH RIGHT HALF RING\, and for the same reason.

% perl -E 'say chr(0x1E9A) =~ /^\p{Hex Digit}$/ ? "yup" : "nope"' nope % perl -E 'say chr(0x1E9A) =~ /^\p{Hex Digit}$/i ? "yup" : "nope"' yup

% perl -E 'say chr(0x1E9A) =~ /^\p{ASCII Hex Digit}$/ ? "yup" : "nope"' nope % perl -E 'say chr(0x1E9A) =~ /^\p{ASCII Hex Digit}$/i ? "yup" : "nope"' yup

% perl -E 'say chr(0x1E9A) =~ /^\p{ XDigit }$/ ? "yup" : "nope"' nope % perl -E 'say chr(0x1E9A) =~ /^\p{ XDigit }$/i ? "yup" : "nope"' yup

However\, that theory is contra-indicated by this:

% perl -E 'say chr(0x149) =~ /[[:alpha::]]/ ? "yup" : "nope"' nope % perl -E 'say chr(0x149) =~ /[[:alpha::]]/i ? "yup" : "nope"' nope

But the combiled regex suggests that with the ligatures\, this is what's happening:

Matching REx "^\p{XDigit}$" against "%x{fb02}" UTF-8 string... 0 \<> \<%x{fb02}> | 1:BOL(2) 0 \<> \<%x{fb02}> | 2:ANYOF{i}[{unicode}+utf8::XDigit](14) 3 \<%x{fb02}> \<> | 14:EOL(15) 3 \<%x{fb02}> \<> | 15:END(0) Match successful! All case-insensitive XDigit: Yes

--tom

# SpecialCasing-5.1.0.txt # Date: 2008-03-03\, 21:58:10 GMT [MD] # # Unicode Character Database # Copyright (c) 1991-2008 Unicode\, Inc. # For terms of use\, see http://www.unicode.org/terms_of_use.html # For documentation\, see UCD.html # # Special Casing Properties # # This file is a supplement to the UnicodeData file. # It contains additional information about the casing of Unicode characters. # (For compatibility\, the UnicodeData.txt file only contains case mappings for # characters where they are 1-1\, and independent of context and language. # For more information\, see the discussion of Case Mappings in the Unicode Standard. # # All code points not listed in this file that do not have a simple case mappings # in UnicodeData.txt map to themselves. # ================================================================================ # Format # ================================================================================ # The entries in this file are in the following machine-readable format: # # \; \ ; \ ; \<upper> ; (\<condition_list> ;)? # \<comment> #</p> <p># Ligatures</p> <p>FB00; FB00; 0046 0066; 0046 0046; # LATIN SMALL LIGATURE FF FB01; FB01; 0046 0069; 0046 0049; # LATIN SMALL LIGATURE FI FB02; FB02; 0046 006C; 0046 004C; # LATIN SMALL LIGATURE FL FB03; FB03; 0046 0066 0069; 0046 0046 0049; # LATIN SMALL LIGATURE FFI FB04; FB04; 0046 0066 006C; 0046 0046 004C; # LATIN SMALL LIGATURE FFL</p> <p># No corresponding uppercase precomposed character</p> <p>0149; 0149; 02BC 004E; 02BC 004E; # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE 1E9A; 1E9A; 0041 02BE; 0041 02BE; # LATIN SMALL LETTER A WITH RIGHT HALF RING</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/p5pRT"><img src="https://avatars.githubusercontent.com/u/51798018?v=4" />p5pRT</a> commented <strong> 13 years ago</strong> </div> <div class="markdown-body"> <h3>From @khwilliamson</h3> <p>This has been fixed in 5.14</p> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/p5pRT"><img src="https://avatars.githubusercontent.com/u/51798018?v=4" />p5pRT</a> commented <strong> 13 years ago</strong> </div> <div class="markdown-body"> <p>@khwilliamson - Status changed from 'open' to 'resolved'</p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>

Perl / perl5