Closed p5pRT closed 15 years ago
Tracking down a reason for crashes of a perl process while processing certain obfuscated spam messages\, it turns out that an utf-8 character with a large (and invalid) codepoint is causing a perl 5.10.1 crash while matching such string to a particular regular expression.
This is happening on a FreeBSD 7.2\, using perl as installed from ports with no special settings.
Reducing the actual crashing application to a small test case\, here it is:
#!/usr/bin/perl -T use strict;
# Here is a HTML snippet from a malicious/obfuscated mail message. # Note the last character has an invalid and huge UTF-8 code # (as a result of an unrelated bug in HTML::Parser). # my $t = '\Attention Homeοωnөrs...1ѕt '. 'Tімe Eνө􆞃\';
$t =~ s/&#(\d+)/chr($1)/ge; # convert HTML entities to UTF8 $t .= substr($ENV{PATH}\,0\,0); # make it tainted
# show character codes in the resulting string print join("\, "\, map {ord} split(//\,$t))\, "\n";
# The following regexp evaluation crashes perl 5.10.1 on FreeBSD. # Note that $t must be tainted and must have the UTF8 flag on\, # otherwise the crash seems to be avoided.
$t =~ /( |\b)(http:|www\.)/i;
and here is the result (hand wrapped):
60\, 97\, 62\, 65\, 116\, 116\, 101\, 110\, 116\, 105\, 111\, 110\, 32\, 72\, 111\, 109\, 101\, 959\, 969\, 110\, 1257\, 114\, 115\, 46\, 46\, 46\, 49\, 1109\, 116\, 32\, 84\, 1110\, 1084\, 101\, 32\, 69\, 957\, 1257\, 1075075\, 60\, 47\, 97\, 62 Segmentation fault: 11 (core dumped)
Here is a backtrace as obtained from a core dump (cut/pasted from screen\, the actual 8-bit characters may be wrong):
$ gdb -c perl5.10.1.core /usr/local/bin/perl5.10.1
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation\, Inc.
GDB is free software\, covered by the GNU General Public License\, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...
Core was generated by `perl5.10.1'.
Program terminated with signal 11\, Segmentation fault.
Reading symbols from /usr/local/lib/perl5/5.10.1/mach/CORE/libperl.so...done.
Loaded symbols for /usr/local/lib/perl5/5.10.1/mach/CORE/libperl.so
Reading symbols from /lib/libm.so.5...done.
Loaded symbols for /lib/libm.so.5
Reading symbols from /lib/libcrypt.so.4...done.
Loaded symbols for /lib/libcrypt.so.4
Reading symbols from /lib/libutil.so.7...done.
Loaded symbols for /lib/libutil.so.7
Reading symbols from /lib/libc.so.7...done.
Loaded symbols for /lib/libc.so.7
Reading symbols from /libexec/ld-elf.so.1...done.
Loaded symbols for /libexec/ld-elf.so.1
#0 0x00000000408bb101 in S_regmatch (reginfo=0x7fffffffe590\, prog=0x411143a4) at regexec.c:3049
3049 REXEC_TRIE_READ_CHAR(trie_type\, trie\, widecharmap\, uc\,
(gdb) bt
#0 0x00000000408bb101 in S_regmatch (reginfo=0x7fffffffe590\, prog=0x411143a4) at regexec.c:3049
#1 0x00000000408b7b0a in S_regtry (reginfo=0x7fffffffe590\, startpos=0x7fffffffe6d8) at regexec.c:2355
#2 0x00000000408b6a7a in Perl_regexec_flags (prog=0x4114f1a0\,
stringarg=0x4111d6c0 "\Attention HomeοÏ\211nÓ©rs...1Ñ\225t TÑ\226мe Eνөô\206\236\203\"\,
strend=0x4111d6f3 "/a>"\,
strbeg=0x4111d6c0 "\Attention HomeοÏ\211nÓ©rs...1Ñ\225t TÑ\226мe Eνөô\206\236\203\"\, minend=0\,
sv=0x4113ec48\, data=0x0\, flags=3) at regexec.c:2146
#3 0x00000000407864a3 in Perl_pp_match () at pp_hot.c:1356
#4 0x000000004073fa4c in Perl_runops_debug () at dump.c:1968
#5 0x00000000406905d8 in S_run_body (oldscope=1) at perl.c:2431
#6 0x000000004068f9b0 in perl_run (my_perl=0x41102104) at perl.c:2349
#7 0x0000000000400bf4 in main (argc=3\, argv=0x7fffffffea90\, env=0x7fffffffeab0) at perlmain.c:117
(gdb)
And lastly\, here is a perl debug output using the -Dr command line option:
Compiling REx "( |\b)(http:|www\.)" Final program: 1: OPEN1 (3) 3: BRANCH (6) 4: EXACTF \< > (8) 6: BRANCH (FAIL) 7: BOUND (8) 8: CLOSE1 (10) 10: OPEN2 (12) 12: TRIE-EXACTF[HWhw] (19) \<http:> \<www.> 19: CLOSE2 (21) 21: END (0) minlen 4 Omitting $` $& $' support.
EXECUTING...
[...]
Matching REx "( |\b)(http:|www\.)" against "\Attention Home%x{3bf}%x{3c9}n%x{4e9}rs...1%x{455}t T"...
UTF-8 string...
0 \<> \<\Attenti> | 1:OPEN1(3)
0 \<> \<\Attenti> | 3:BRANCH(6)
0 \<> \<\Attenti> | 4: EXACTF \< >(8)
Compiling REx "(^|[/\\])warnings\.pmc?$"
rarest char w at 0
Final program:
1: OPEN1 (3)
3: BRANCH (5)
4: BOL (17)
5: BRANCH (FAIL)
6: ANYOF[/\\][] (17)
17: CLOSE1 (19)
19: EXACT \<warnings.pm> (23)
23: CURLY {0\,1} (27)
25: EXACT \
Some additional information on non-vulnerable systems\, provided by Jan iankko Lieskovsky / Red Hat Security Response Team :
This issue affects only perl-5.10.1: (didn't check perl-5.11.1.tar.gz though).
But did check perl-{5.8.0\, 5.8.5\, 5.8.8\, 5.10.0} and the provided reproducer [2] doesn't crash on these versions\, while it cleanly crashes on upstream perl-5.10.1.tar.gz. [...] Have checked both versions (with / without (original PoC)) this add-on on perl-{5.8.0\, 5.8.5\, 5.8.8\, 5.10.0} - these are not vulnerable.
Both versions (original && modified PoC) crash with perl-5.10.1.tar.gz.
(Hopefully the above information could be stated also in upstream PerlBug to explicitly mention (not)vulnerable versions).
2009/10/22 Mark Martinec \perlbug\-followup@​perl\.org:
# New Ticket Created by Mark Martinec # Please include the string: [perl #69973] # in the subject line of all future correspondence about this issue. # \<URL: http://rt.perl.org/rt3/Ticket/Display.html?id=69973 >
This is a bug report for perl from Mark.Martinec@ijs.si\, generated with the help of perlbug 1.39 running under perl 5.10.1.
----------------------------------------------------------------- [Please describe your issue here]
Tracking down a reason for crashes of a perl process while processing certain obfuscated spam messages\, it turns out that an utf-8 character with a large (and invalid) codepoint is causing a perl 5.10.1 crash while matching such string to a particular regular expression.
This is happening on a FreeBSD 7.2\, using perl as installed from ports with no special settings.
Reducing the actual crashing application to a small test case\, here it is:
#!/usr/bin/perl -T use strict;
# Here is a HTML snippet from a malicious/obfuscated mail message. # Note the last character has an invalid and huge UTF-8 code # (as a result of an unrelated bug in HTML::Parser). # my $t = '\Attention Homeοωnөrs...1ѕt '. 'Tімe Eνө􆞃\';
$t =~ s/&#(\d+)/chr($1)/ge; # convert HTML entities to UTF8 $t .= substr($ENV{PATH}\,0\,0); # make it tainted
# show character codes in the resulting string print join("\, "\, map {ord} split(//\,$t))\, "\n";
# The following regexp evaluation crashes perl 5.10.1 on FreeBSD. # Note that $t must be tainted and must have the UTF8 flag on\, # otherwise the crash seems to be avoided.
$t =~ /( |\b)(http:|www\.)/i;
and here is the result (hand wrapped):
60\, 97\, 62\, 65\, 116\, 116\, 101\, 110\, 116\, 105\, 111\, 110\, 32\, 72\, 111\, 109\, 101\, 959\, 969\, 110\, 1257\, 114\, 115\, 46\, 46\, 46\, 49\, 1109\, 116\, 32\, 84\, 1110\, 1084\, 101\, 32\, 69\, 957\, 1257\, 1075075\, 60\, 47\, 97\, 62 Segmentation fault: 11 (core dumped)
Here is a backtrace as obtained from a core dump (cut/pasted from screen\, the actual 8-bit characters may be wrong): $ gdb -c perl5.10.1.core /usr/local/bin/perl5.10.1 GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation\, Inc. GDB is free software\, covered by the GNU General Public License\, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "amd64-marcel-freebsd"... Core was generated by `perl5.10.1'. Program terminated with signal 11\, Segmentation fault. Reading symbols from /usr/local/lib/perl5/5.10.1/mach/CORE/libperl.so...done. Loaded symbols for /usr/local/lib/perl5/5.10.1/mach/CORE/libperl.so Reading symbols from /lib/libm.so.5...done. Loaded symbols for /lib/libm.so.5 Reading symbols from /lib/libcrypt.so.4...done. Loaded symbols for /lib/libcrypt.so.4 Reading symbols from /lib/libutil.so.7...done. Loaded symbols for /lib/libutil.so.7 Reading symbols from /lib/libc.so.7...done. Loaded symbols for /lib/libc.so.7 Reading symbols from /libexec/ld-elf.so.1...done. Loaded symbols for /libexec/ld-elf.so.1 #0 0x00000000408bb101 in S_regmatch (reginfo=0x7fffffffe590\, prog=0x411143a4) at regexec.c:3049 3049 REXEC_TRIE_READ_CHAR(trie_type\, trie\, widecharmap\, uc\,
Unfortunately this is just masking the cause\, im pretty sure the problem is in utf8.c
You would have ended up in this code:
case trie_utf8_fold: \ if ( foldlen>0 ) { \ uvc_unfolded = uvc = utf8n_to_uvuni( uscan\, UTF8_MAXLEN\, &len\, uniflags ); \ foldlen -= len; \ uscan += len; \ len=0; \ } else { \ uvc_unfolded = uvc = utf8n_to_uvuni( (U8*)uc\, UTF8_MAXLEN\, &len\, uniflags ); \ uvc = to_uni_fold( uvc\, foldbuf\, &foldlen ); \ foldlen -= UNISKIP( uvc ); \ uscan = foldbuf + UNISKIP( uvc ); \ } \ break;
Im guessing in the second clause\, probably in to_uni_fold().
(gdb) bt #0 0x00000000408bb101 in S_regmatch (reginfo=0x7fffffffe590\, prog=0x411143a4) at regexec.c:3049 #1 0x00000000408b7b0a in S_regtry (reginfo=0x7fffffffe590\, startpos=0x7fffffffe6d8) at regexec.c:2355 #2 0x00000000408b6a7a in Perl_regexec_flags (prog=0x4114f1a0\, stringarg=0x4111d6c0 "\Attention HomeοÏ\211nÓ©rs...1Ñ\225t TÑ\226мe Eνөô\206\236\203\"\, strend=0x4111d6f3 "/a>"\, strbeg=0x4111d6c0 "\Attention HomeοÏ\211nÓ©rs...1Ñ\225t TÑ\226мe Eνөô\206\236\203\"\, minend=0\, sv=0x4113ec48\, data=0x0\, flags=3) at regexec.c:2146 #3 0x00000000407864a3 in Perl_pp_match () at pp_hot.c:1356 #4 0x000000004073fa4c in Perl_runops_debug () at dump.c:1968 #5 0x00000000406905d8 in S_run_body (oldscope=1) at perl.c:2431 #6 0x000000004068f9b0 in perl_run (my_perl=0x41102104) at perl.c:2349 #7 0x0000000000400bf4 in main (argc=3\, argv=0x7fffffffea90\, env=0x7fffffffeab0) at perlmain.c:117
(gdb)
And lastly\, here is a perl debug output using the -Dr command line option:
Thanks\, your report is very complete.
Compiling REx "( |\b)(http:|www\.)" Final program: 1: OPEN1 (3) 3: BRANCH (6) 4: EXACTF \< > (8) 6: BRANCH (FAIL) 7: BOUND (8) 8: CLOSE1 (10) 10: OPEN2 (12) 12: TRIE-EXACTF[HWhw] (19) \<http:> \<www.> 19: CLOSE2 (21) 21: END (0) minlen 4 Omitting $` $& $' support.
EXECUTING... [...] 46 \<E%x{3bd}%x{4e9}> \<%x{106783}>| 1:OPEN1(3) 46 \<E%x{3bd}%x{4e9}> \<%x{106783}>| 3:BRANCH(6) 46 \<E%x{3bd}%x{4e9}> \<%x{106783}>| 4: EXACTF \< >(8) failed... 46 \<E%x{3bd}%x{4e9}> \<%x{106783}>| 6:BRANCH(8) 46 \<E%x{3bd}%x{4e9}> \<%x{106783}>| 7: BOUND(8) 46 \<E%x{3bd}%x{4e9}> \<%x{106783}>| 8: CLOSE1(10) 46 \<E%x{3bd}%x{4e9}> \<%x{106783}>| 10: OPEN2(12) 46 \<E%x{3bd}%x{4e9}> \<%x{106783}>| 12: TRIE-EXACTF[HWhw](19) 46 \<E%x{3bd}%x{4e9}> \<%x{106783}>| State: 1 Accepted: 0
I think the regex engine is the only place that uses the unicode folding logic. Ill try to dig further.
cheers\, Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
The RT System itself - Status changed from 'new' to 'open'
2009/10/22 Mark Martinec \perlbug\-followup@​perl\.org:
# New Ticket Created by Mark Martinec # Please include the string: [perl #69973] # in the subject line of all future correspondence about this issue. # \<URL: http://rt.perl.org/rt3/Ticket/Display.html?id=69973 >
This is a bug report for perl from Mark.Martinec@ijs.si\, generated with the help of perlbug 1.39 running under perl 5.10.1.
----------------------------------------------------------------- [Please describe your issue here]
Tracking down a reason for crashes of a perl process while processing certain obfuscated spam messages\, it turns out that an utf-8 character with a large (and invalid) codepoint is causing a perl 5.10.1 crash while matching such string to a particular regular expression.
This is happening on a FreeBSD 7.2\, using perl as installed from ports with no special settings.
Reducing the actual crashing application to a small test case\, here it is:
#!/usr/bin/perl -T use strict;
# Here is a HTML snippet from a malicious/obfuscated mail message. # Note the last character has an invalid and huge UTF-8 code # (as a result of an unrelated bug in HTML::Parser). # my $t = '\Attention Homeοωnөrs...1ѕt '. 'Tімe Eνө􆞃\';
$t =~ s/&#(\d+)/chr($1)/ge; # convert HTML entities to UTF8 $t .= substr($ENV{PATH}\,0\,0); # make it tainted
# show character codes in the resulting string print join("\, "\, map {ord} split(//\,$t))\, "\n";
# The following regexp evaluation crashes perl 5.10.1 on FreeBSD. # Note that $t must be tainted and must have the UTF8 flag on\, # otherwise the crash seems to be avoided.
$t =~ /( |\b)(http:|www\.)/i;
Bisected down to 8902bb05b18c9858efa90229ca1ee42b17277554 (http://perl5.git.perl.org/perl.git/commit/8902bb05):
Author: Slaven Rezic \slaven@​rezic\.de Date: Sun Jan 4 17:28:33 2009 +0100
Another regexp failure with utf8-flagged string and byte-flagged
pattern (reminder)
Date: 17 Nov 2007 16:29:29 +0100
Message-ID: \87r6iohova\.fsf@​biokovo\-amd64\.herceg\.de
(cherry picked from commit c012444fd89eef64e1d1687642cdb9f968e96739)
Vincent.
2009/10/23 Vincent Pit \perl@​profvince\.com:
2009/10/22 Mark Martinec \perlbug\-followup@​perl\.org:
# New Ticket Created by Mark Martinec # Please include the string: [perl #69973] # in the subject line of all future correspondence about this issue. # \<URL: http://rt.perl.org/rt3/Ticket/Display.html?id=69973 >
This is a bug report for perl from Mark.Martinec@ijs.si\, generated with the help of perlbug 1.39 running under perl 5.10.1.
----------------------------------------------------------------- [Please describe your issue here]
Tracking down a reason for crashes of a perl process while processing certain obfuscated spam messages\, it turns out that an utf-8 character with a large (and invalid) codepoint is causing a perl 5.10.1 crash while matching such string to a particular regular expression.
This is happening on a FreeBSD 7.2\, using perl as installed from ports with no special settings.
Reducing the actual crashing application to a small test case\, here it is:
#!/usr/bin/perl -T use strict;
# Here is a HTML snippet from a malicious/obfuscated mail message. # Note the last character has an invalid and huge UTF-8 code # (as a result of an unrelated bug in HTML::Parser). # my $t = '\Attention Homeοωnөrs...1ѕt '. 'Tімe Eνө􆞃\';
$t =~ s/&#(\d+)/chr($1)/ge; # convert HTML entities to UTF8 $t .= substr($ENV{PATH}\,0\,0); # make it tainted
# show character codes in the resulting string print join("\, "\, map {ord} split(//\,$t))\, "\n";
# The following regexp evaluation crashes perl 5.10.1 on FreeBSD. # Note that $t must be tainted and must have the UTF8 flag on\, # otherwise the crash seems to be avoided.
$t =~ /( |\b)(http:|www\.)/i;
Bisected down to 8902bb05b18c9858efa90229ca1ee42b17277554 (http://perl5.git.perl.org/perl.git/commit/8902bb05):
Author: Slaven Rezic \slaven@​rezic\.de Date: Sun Jan 4 17:28:33 2009 +0100
Another regexp failure with utf8-flagged string and byte-flagged pattern (reminder)
Date: 17 Nov 2007 16:29:29 +0100 Message-ID: \87r6iohova\.fsf@​biokovo\-amd64\.herceg\.de
(cherry picked from commit c012444fd89eef64e1d1687642cdb9f968e96739)
thanks - that helps a lot. yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
2009/10/23 Vincent Pit \perl@​profvince\.com:
2009/10/22 Mark Martinec \perlbug\-followup@​perl\.org:
# New Ticket Created by Mark Martinec # Please include the string: [perl #69973] # in the subject line of all future correspondence about this issue. # \<URL: http://rt.perl.org/rt3/Ticket/Display.html?id=69973 >
This is a bug report for perl from Mark.Martinec@ijs.si\, generated with the help of perlbug 1.39 running under perl 5.10.1.
----------------------------------------------------------------- [Please describe your issue here]
Tracking down a reason for crashes of a perl process while processing certain obfuscated spam messages\, it turns out that an utf-8 character with a large (and invalid) codepoint is causing a perl 5.10.1 crash while matching such string to a particular regular expression.
This is happening on a FreeBSD 7.2\, using perl as installed from ports with no special settings.
Reducing the actual crashing application to a small test case\, here it is:
#!/usr/bin/perl -T use strict;
# Here is a HTML snippet from a malicious/obfuscated mail message. # Note the last character has an invalid and huge UTF-8 code # (as a result of an unrelated bug in HTML::Parser). # my $t = '\Attention Homeοωnөrs...1ѕt '. 'Tімe Eνө􆞃\';
$t =~ s/&#(\d+)/chr($1)/ge; # convert HTML entities to UTF8 $t .= substr($ENV{PATH}\,0\,0); # make it tainted
# show character codes in the resulting string print join("\, "\, map {ord} split(//\,$t))\, "\n";
# The following regexp evaluation crashes perl 5.10.1 on FreeBSD. # Note that $t must be tainted and must have the UTF8 flag on\, # otherwise the crash seems to be avoided.
$t =~ /( |\b)(http:|www\.)/i;
Bisected down to 8902bb05b18c9858efa90229ca1ee42b17277554 (http://perl5.git.perl.org/perl.git/commit/8902bb05):
Author: Slaven Rezic \slaven@​rezic\.de Date: Sun Jan 4 17:28:33 2009 +0100
Another regexp failure with utf8-flagged string and byte-flagged pattern (reminder)
Date: 17 Nov 2007 16:29:29 +0100 Message-ID: \87r6iohova\.fsf@​biokovo\-amd64\.herceg\.de
(cherry picked from commit c012444fd89eef64e1d1687642cdb9f968e96739)
The simple fix is to add a guard to the if clause to prevent looking up chars>255.
The thing is the original patch sorta hides a deeper problem. It may be that the trie stuff just has to be disabled for case insensitive matches. As there doesnt seem to be a way to support the run time "decide how to match based on the string AND pattern" behaviour of earlier perls in the trie structure without breaking case insensitive matches.
For instance in old perls:
use Test::More; use Encode; { # more TRIE/AHOCORASICK problems with mixed utf8 / latin-1 and case folding for my $chr (181) { #160 .. 255) { my $chr_byte = chr($chr); my $chr_utf8 = chr($chr); utf8::upgrade($chr_utf8); my $chr_high = chr(0x3bc); my $rx = qr{.?(?:$chr_byte|X)}i; ok($chr_utf8 =~ $rx\, "utf8/latin\, codepoint $chr ". encode_utf8($chr_utf8)); ok($chr_high =~ $rx\, "utf8/latin\, codepoint $chr ". encode_utf8($chr_utf8)); } }
Should match. In TRIE'd perls it wont. As in unicode rules these rules apply that do not in the non-unicode behaviour:
00B5; C; 03BC; # MICRO SIGN 00C0; C; 00E0; # LATIN CAPITAL LETTER A WITH GRAVE 00C1; C; 00E1; # LATIN CAPITAL LETTER A WITH ACUTE 00C2; C; 00E2; # LATIN CAPITAL LETTER A WITH CIRCUMFLEX 00C3; C; 00E3; # LATIN CAPITAL LETTER A WITH TILDE 00C4; C; 00E4; # LATIN CAPITAL LETTER A WITH DIAERESIS 00C5; C; 00E5; # LATIN CAPITAL LETTER A WITH RING ABOVE 00C6; C; 00E6; # LATIN CAPITAL LETTER AE 00C7; C; 00E7; # LATIN CAPITAL LETTER C WITH CEDILLA 00C8; C; 00E8; # LATIN CAPITAL LETTER E WITH GRAVE 00C9; C; 00E9; # LATIN CAPITAL LETTER E WITH ACUTE 00CA; C; 00EA; # LATIN CAPITAL LETTER E WITH CIRCUMFLEX 00CB; C; 00EB; # LATIN CAPITAL LETTER E WITH DIAERESIS 00CC; C; 00EC; # LATIN CAPITAL LETTER I WITH GRAVE 00CD; C; 00ED; # LATIN CAPITAL LETTER I WITH ACUTE 00CE; C; 00EE; # LATIN CAPITAL LETTER I WITH CIRCUMFLEX 00CF; C; 00EF; # LATIN CAPITAL LETTER I WITH DIAERESIS 00D0; C; 00F0; # LATIN CAPITAL LETTER ETH 00D1; C; 00F1; # LATIN CAPITAL LETTER N WITH TILDE 00D2; C; 00F2; # LATIN CAPITAL LETTER O WITH GRAVE 00D3; C; 00F3; # LATIN CAPITAL LETTER O WITH ACUTE 00D4; C; 00F4; # LATIN CAPITAL LETTER O WITH CIRCUMFLEX 00D5; C; 00F5; # LATIN CAPITAL LETTER O WITH TILDE 00D6; C; 00F6; # LATIN CAPITAL LETTER O WITH DIAERESIS 00D8; C; 00F8; # LATIN CAPITAL LETTER O WITH STROKE 00D9; C; 00F9; # LATIN CAPITAL LETTER U WITH GRAVE 00DA; C; 00FA; # LATIN CAPITAL LETTER U WITH ACUTE 00DB; C; 00FB; # LATIN CAPITAL LETTER U WITH CIRCUMFLEX 00DC; C; 00FC; # LATIN CAPITAL LETTER U WITH DIAERESIS 00DD; C; 00FD; # LATIN CAPITAL LETTER Y WITH ACUTE 00DE; C; 00FE; # LATIN CAPITAL LETTER THORN 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
I suppose any non-unicode pattern that doesnt use these can still be case-insensitively matched with a trie.
Hrmph.
Cheers\, Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
Resolved by:
commit 0abd0d78a73da1c4d13b1c700526b7e5d03b32d4 Author: Yves Orton \demerphq@​gmail\.com Date: Sun Oct 25 20:37:08 2009 +0100
disable non-unicode case insensitive trie matching
Also revert 8902bb05b18c9858efa90229ca1ee42b17277554 as it merely
masked one symptom of the deeper problems.
Also fixes RT #69973\, which was a segfault which was exposed by
8902bb05\, see the ticket for further details.
http://rt.perl.org/rt3//Public/Bug/Display.html?id=69973
At the code of this is the problem that in unicode matching a bunch
of code points have case folding rules beyond just A-Z/a-z. Since
the case folding rules are decided at runtime by the string\, we cant
use the same TRIE tables for both unicode/non-unicode matching.
Until this is reconciled or some other solution is found case
insensitive
matching only gets the TRIE optimisation when the pattern is uniocde.
From CaseFolding.txt:
00B5; C; 03BC; # MICRO SIGN
00C0; C; 00E0; # LATIN CAPITAL LETTER A WITH GRAVE
00C1; C; 00E1; # LATIN CAPITAL LETTER A WITH ACUTE
00C2; C; 00E2; # LATIN CAPITAL LETTER A WITH CIRCUMFLEX
00C3; C; 00E3; # LATIN CAPITAL LETTER A WITH TILDE
00C4; C; 00E4; # LATIN CAPITAL LETTER A WITH DIAERESIS
00C5; C; 00E5; # LATIN CAPITAL LETTER A WITH RING ABOVE
00C6; C; 00E6; # LATIN CAPITAL LETTER AE
00C7; C; 00E7; # LATIN CAPITAL LETTER C WITH CEDILLA
00C8; C; 00E8; # LATIN CAPITAL LETTER E WITH GRAVE
00C9; C; 00E9; # LATIN CAPITAL LETTER E WITH ACUTE
00CA; C; 00EA; # LATIN CAPITAL LETTER E WITH CIRCUMFLEX
00CB; C; 00EB; # LATIN CAPITAL LETTER E WITH DIAERESIS
00CC; C; 00EC; # LATIN CAPITAL LETTER I WITH GRAVE
00CD; C; 00ED; # LATIN CAPITAL LETTER I WITH ACUTE
00CE; C; 00EE; # LATIN CAPITAL LETTER I WITH CIRCUMFLEX
00CF; C; 00EF; # LATIN CAPITAL LETTER I WITH DIAERESIS
00D0; C; 00F0; # LATIN CAPITAL LETTER ETH
00D1; C; 00F1; # LATIN CAPITAL LETTER N WITH TILDE
00D2; C; 00F2; # LATIN CAPITAL LETTER O WITH GRAVE
00D3; C; 00F3; # LATIN CAPITAL LETTER O WITH ACUTE
00D4; C; 00F4; # LATIN CAPITAL LETTER O WITH CIRCUMFLEX
00D5; C; 00F5; # LATIN CAPITAL LETTER O WITH TILDE
00D6; C; 00F6; # LATIN CAPITAL LETTER O WITH DIAERESIS
00D8; C; 00F8; # LATIN CAPITAL LETTER O WITH STROKE
00D9; C; 00F9; # LATIN CAPITAL LETTER U WITH GRAVE
00DA; C; 00FA; # LATIN CAPITAL LETTER U WITH ACUTE
00DB; C; 00FB; # LATIN CAPITAL LETTER U WITH CIRCUMFLEX
00DC; C; 00FC; # LATIN CAPITAL LETTER U WITH DIAERESIS
00DD; C; 00FD; # LATIN CAPITAL LETTER Y WITH ACUTE
00DE; C; 00FE; # LATIN CAPITAL LETTER THORN
00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
@demerphq - Status changed from 'open' to 'resolved'
Migrated from rt.perl.org#69973 (status was 'resolved')
Searchable as RT69973$