Invalid and tainted utf-8 char crashes perl 5.10.1 in regexp evaluation

From Mark.Martinec@ijs.si

Created by Mark.Martinec@ijs.si

Tracking down a reason for crashes of a perl process while processing certain obfuscated spam messages\, it turns out that an utf-8 character with a large (and invalid) codepoint is causing a perl 5.10.1 crash while matching such string to a particular regular expression.

This is happening on a FreeBSD 7.2\, using perl as installed from ports with no special settings.

Reducing the actual crashing application to a small test case\, here it is:

#!/usr/bin/perl -T use strict;

# Here is a HTML snippet from a malicious/obfuscated mail message. # Note the last character has an invalid and huge UTF-8 code # (as a result of an unrelated bug in HTML::Parser). # my $t = '\Attention Home&#959&#969n&#1257rs...1&#1109t '. 'T&#1110&#1084e E&#957&#1257&#1075075\';

$t =~ s/&#(\d+)/chr($1)/ge; # convert HTML entities to UTF8 $t .= substr($ENV{PATH}\,0\,0); # make it tainted

# show character codes in the resulting string print join("\, "\, map {ord} split(//\,$t))\, "\n";

# The following regexp evaluation crashes perl 5.10.1 on FreeBSD. # Note that $t must be tainted and must have the UTF8 flag on\, # otherwise the crash seems to be avoided.

$t =~ /( |\b)(http:|www\.)/i;

and here is the result (hand wrapped):

60\, 97\, 62\, 65\, 116\, 116\, 101\, 110\, 116\, 105\, 111\, 110\, 32\, 72\, 111\, 109\, 101\, 959\, 969\, 110\, 1257\, 114\, 115\, 46\, 46\, 46\, 49\, 1109\, 116\, 32\, 84\, 1110\, 1084\, 101\, 32\, 69\, 957\, 1257\, 1075075\, 60\, 47\, 97\, 62 Segmentation fault: 11 (core dumped)

Here is a backtrace as obtained from a core dump (cut/pasted from screen\, the actual 8-bit characters may be wrong):

$ gdb -c perl5.10.1.core /usr/local/bin/perl5.10.1
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation\, Inc.
GDB is free software\, covered by the GNU General Public License\, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...
Core was generated by `perl5.10.1'.
Program terminated with signal 11\, Segmentation fault.
Reading symbols from /usr/local/lib/perl5/5.10.1/mach/CORE/libperl.so...done. Loaded symbols for /usr/local/lib/perl5/5.10.1/mach/CORE/libperl.so
Reading symbols from /lib/libm.so.5...done.
Loaded symbols for /lib/libm.so.5
Reading symbols from /lib/libcrypt.so.4...done.
Loaded symbols for /lib/libcrypt.so.4
Reading symbols from /lib/libutil.so.7...done.
Loaded symbols for /lib/libutil.so.7
Reading symbols from /lib/libc.so.7...done.
Loaded symbols for /lib/libc.so.7
Reading symbols from /libexec/ld-elf.so.1...done.
Loaded symbols for /libexec/ld-elf.so.1
#0 0x00000000408bb101 in S_regmatch (reginfo=0x7fffffffe590\, prog=0x411143a4) at regexec.c:3049 3049 REXEC_TRIE_READ_CHAR(trie_type\, trie\, widecharmap\, uc\,

(gdb) bt
#0 0x00000000408bb101 in S_regmatch (reginfo=0x7fffffffe590\, prog=0x411143a4) at regexec.c:3049 #1 0x00000000408b7b0a in S_regtry (reginfo=0x7fffffffe590\, startpos=0x7fffffffe6d8) at regexec.c:2355 #2 0x00000000408b6a7a in Perl_regexec_flags (prog=0x4114f1a0\,
stringarg=0x4111d6c0 "\Attention HomeÎ¿Ï\211nÓ©rs...1Ñ\225t TÑ\226Ð¼e EÎ½Ó©ô\206\236\203\"\,
strend=0x4111d6f3 "/a>"\,
strbeg=0x4111d6c0 "\Attention HomeÎ¿Ï\211nÓ©rs...1Ñ\225t TÑ\226Ð¼e EÎ½Ó©ô\206\236\203\"\, minend=0\, sv=0x4113ec48\, data=0x0\, flags=3) at regexec.c:2146
#3 0x00000000407864a3 in Perl_pp_match () at pp_hot.c:1356
#4 0x000000004073fa4c in Perl_runops_debug () at dump.c:1968
#5 0x00000000406905d8 in S_run_body (oldscope=1) at perl.c:2431
#6 0x000000004068f9b0 in perl_run (my_perl=0x41102104) at perl.c:2349
#7 0x0000000000400bf4 in main (argc=3\, argv=0x7fffffffea90\, env=0x7fffffffeab0) at perlmain.c:117

(gdb)

And lastly\, here is a perl debug output using the -Dr command line option:

Compiling REx "( |\b)(http:|www\.)" Final program: 1: OPEN1 (3) 3: BRANCH (6) 4: EXACTF \< > (8) 6: BRANCH (FAIL) 7: BOUND (8) 8: CLOSE1 (10) 10: OPEN2 (12) 12: TRIE-EXACTF[HWhw] (19) \<http:> \<www.> 19: CLOSE2 (21) 21: END (0) minlen 4 Omitting $` $& $' support.

EXECUTING...

Perl / perl5

Invalid and tainted utf-8 char crashes perl 5.10.1 in regexp evaluation #9922

From Mark.Martinec@ijs.si

Created by Mark.Martinec@ijs.si