Closed p5pRT closed 13 years ago
I ran into a strange problem where using [[:xdigit:]] in a case-insensitive regex with multi-character unicode characters such as "U+FB02 LATIN SMALL LIGATURE FL" matches when it shouldn't.
Test case:
perl -Mutf8 -wle ' my $fl = "fl"; print "Length: " . length($fl); print "Upper-case: " . uc($fl); print "Hex digit: " . ($fl =~ /^[[:xdigit:]]$/ ? "Yes" : "No"); print "Case-insensitive hex digit: " . ($fl =~ /^[[:xdigit:]]$/i ? "Yes" : "No");'
This prints:
Length: 1 Upper-case: FL Hex digit: No Case-insensitive hex digit: Yes
The last output is clearly wrong.
Changing the [[:xdigit:]]s to [0-9a-fA-F] corrects the problem.
On Sat\, 18 Apr 2009 10:27:52 -0700\, Jason Rhinelander (via RT) \perlbug\-followup@​perl\.org said:
> The last output is clearly wrong.
Bug confirmed in bleadperl. The short and seven bit version I used was:
perl -e 'print "\x{fb02}" =~ /^[[:xdigit:]]$/i ? "not ok\n" : "ok\n"'
-- andreas
The RT System itself - Status changed from 'new' to 'open'
On Sat\, 18 Apr 2009 10:27:52 -0700\, Jason Rhinelander (via RT) \<pe\
The last output is clearly wrong.
Bug confirmed in bleadperl. The short and seven bit version I used was:
perl -e 'print "\x{fb02}" =~ /^[[:xdigit:]]$/i ? "not ok\n" : "ok\n"'
I've been playing with this.
We're into tricky folding issues here.
What appears to be going on is that the special-casing rules in lib/unicore/SpecialCasing.txt are kicking in\, rendering what appears to be compatible (K) not a canonical decomposition form\, then applying the property test to only the first of the decomposed results\, rather than to all of them. The engine then skips over to the next original character\, missing the remaining decomposed ones.
You can also trigger the problem with other code points\, like U+1E9A\, which is LATIN SMALL LETTER A WITH RIGHT HALF RING\, and for the same reason.
% perl -E 'say chr(0x1E9A) =~ /^\p{Hex Digit}$/ ? "yup" : "nope"' nope % perl -E 'say chr(0x1E9A) =~ /^\p{Hex Digit}$/i ? "yup" : "nope"' yup
% perl -E 'say chr(0x1E9A) =~ /^\p{ASCII Hex Digit}$/ ? "yup" : "nope"' nope % perl -E 'say chr(0x1E9A) =~ /^\p{ASCII Hex Digit}$/i ? "yup" : "nope"' yup
% perl -E 'say chr(0x1E9A) =~ /^\p{ XDigit }$/ ? "yup" : "nope"' nope % perl -E 'say chr(0x1E9A) =~ /^\p{ XDigit }$/i ? "yup" : "nope"' yup
However\, that theory is contra-indicated by this:
% perl -E 'say chr(0x149) =~ /[[:alpha::]]/ ? "yup" : "nope"' nope % perl -E 'say chr(0x149) =~ /[[:alpha::]]/i ? "yup" : "nope"' nope
But the combiled regex suggests that with the ligatures\, this is what's happening:
Matching REx "^\p{XDigit}$" against "%x{fb02}" UTF-8 string... 0 \<> \<%x{fb02}> | 1:BOL(2) 0 \<> \<%x{fb02}> | 2:ANYOF{i}[{unicode}+utf8::XDigit](14) 3 \<%x{fb02}> \<> | 14:EOL(15) 3 \<%x{fb02}> \<> | 15:END(0) Match successful! All case-insensitive XDigit: Yes
--tom
# SpecialCasing-5.1.0.txt
# Date: 2008-03-03\, 21:58:10 GMT [MD]
#
# Unicode Character Database
# Copyright (c) 1991-2008 Unicode\, Inc.
# For terms of use\, see http://www.unicode.org/terms_of_use.html
# For documentation\, see UCD.html
#
# Special Casing Properties
#
# This file is a supplement to the UnicodeData file.
# It contains additional information about the casing of Unicode characters.
# (For compatibility\, the UnicodeData.txt file only contains case mappings for
# characters where they are 1-1\, and independent of context and language.
# For more information\, see the discussion of Case Mappings in the Unicode Standard.
#
# All code points not listed in this file that do not have a simple case mappings
# in UnicodeData.txt map to themselves.
# ================================================================================
# Format
# ================================================================================
# The entries in this file are in the following machine-readable format:
#
# \; \
# Ligatures
FB00; FB00; 0046 0066; 0046 0046; # LATIN SMALL LIGATURE FF FB01; FB01; 0046 0069; 0046 0049; # LATIN SMALL LIGATURE FI FB02; FB02; 0046 006C; 0046 004C; # LATIN SMALL LIGATURE FL FB03; FB03; 0046 0066 0069; 0046 0046 0049; # LATIN SMALL LIGATURE FFI FB04; FB04; 0046 0066 006C; 0046 0046 004C; # LATIN SMALL LIGATURE FFL
# No corresponding uppercase precomposed character
0149; 0149; 02BC 004E; 02BC 004E; # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE 1E9A; 1E9A; 0041 02BE; 0041 02BE; # LATIN SMALL LETTER A WITH RIGHT HALF RING
This has been fixed in 5.14
@khwilliamson - Status changed from 'open' to 'resolved'
Migrated from rt.perl.org#64838 (status was 'resolved')
Searchable as RT64838$