Simple Regex causes SEGV when run on specific data

p5pRT commented 18 years ago

Migrated from rt.perl.org#37836 (status was 'resolved')

Searchable as RT37836$

p5pRT commented 18 years ago

From ralphbolton@mail2Sexy.com

Created by ralphbolton@mail2sexy.com

I've managed to track down a problem with Perl 5.8.6 (perl-5.8.6-18\, RPM on Redhat Fedora 4 with kernel 2.6.13-1.1526_FC4). Essentially\, it causes the Perl process to abort with a Segmentation Violation (SEGV). There's no way to stop Perl doing this (meaning you can't put the offending code in an eval() or anything).

The problem seems to be in running a fairly simple regex on a specific string of data. I don't know exactly what the string of data is\, but I have a file that causes the problem.

In case Perlbug has any transmission issues\, see http://www.coofercat.com/wiki/Perl586SegvRegex

Perl Info

``` Flags: category=core severity=medium This perlbug was built using Perl v5.8.6 in the Red Hat build system. It is being executed now by Perl v5.8.6 - Thu Dec 1 13:48:06 EST 2005. Site configuration information for perl v5.8.6: Configured by Red Hat, Inc. at Thu Dec 1 13:48:06 EST 2005. Summary of my perl5 (revision 5 version 8 subversion 6) configuration: Platform: osname=linux, osvers=2.6.9-22.18.bz155725.elsmp, archname=i386-linux-thread-multi uname='linux hs20-bc1-7.build.redhat.com 2.6.9-22.18.bz155725.elsmp #1 smp thu nov 17 15:34:08 est 2005 i686 i686 i386 gnulinux ' config_args='-des -Doptimize=-O2 -g -pipe -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m32 -march=i386 -mtune=pentium4 -fasynchronous-unwind-tables -Dversion=5.8.6 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red Hat, Inc. -Dinstallprefix=/usr -Dprefix=/usr -Darchname=i386-linux -Dvendorprefix=/usr -Dsiteprefix=/usr -Duseshrplib -Dusethreads -Duseithreads -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Dinstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less -isr -Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_endprotoent_r_proto -Ud_endservent_r_proto -Ud_sethostent_r_proto -Ud_setprotoent_r_proto -Ud_setservent_r_proto -Dinc_version_list=5.8.5 5.8.4 5.8.3' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef useithreads=define usemultiplicity=define useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBUGGING -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm', optimize='-O2 -g -pipe -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m32 -march=i386 -mtune=pentium4 -fasynchronous-unwind-tables', cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBUGGING -fno-strict-aliasing -pipe -I/usr/local/include -I/usr/include/gdbm' ccversion='', gccversion='4.0.2 20051125 (Red Hat 4.0.2-8)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='gcc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc perllibs=-lresolv -lnsl -ldl -lm -lcrypt -lutil -lpthread -lc libc=/lib/libc-2.3.5.so, so=so, useshrplib=true, libperl=libperl.so gnulibc_version='2.3.5' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.8.6/i386-linux-thread-multi/CORE' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib' Locally applied patches: @INC for perl v5.8.6: /usr/lib/perl5/site_perl/5.8.6/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.5/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.4/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.3/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.6 /usr/lib/perl5/site_perl/5.8.5 /usr/lib/perl5/site_perl/5.8.4 /usr/lib/perl5/site_perl/5.8.3 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.6/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.5/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.4/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.3/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.6 /usr/lib/perl5/vendor_perl/5.8.5 /usr/lib/perl5/vendor_perl/5.8.4 /usr/lib/perl5/vendor_perl/5.8.3 /usr/lib/perl5/vendor_perl /usr/lib/perl5/5.8.6/i386-linux-thread-multi /usr/lib/perl5/5.8.6 . Environment for perl v5.8.6: HOME=/home/admin LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/home /admin/bin PERL_BADLANG (unset) SHELL=/bin/bash

_______________________________________________________________
Get the FREE email that has everyone talking at http://www.mail2world.com
Unlimited Email Storage – POP3 – Calendar – SMS – Translator – Much More! ```

p5pRT commented 18 years ago

From ralphbolton@mail2Sexy.com

segv.tgz

p5pRT commented 18 years ago

From @iabyn

On Sun\, Dec 04\, 2005 at 03:26:23PM -0800\, Ralph Bolton wrote:

I've managed to track down a problem with Perl 5.8.6 (perl-5.8.6-18\, RPM on Redhat Fedora 4 with kernel 2.6.13-1.1526_FC4). Essentially\, it causes the Perl process to abort with a Segmentation Violation (SEGV).

The code can be reduced to the following (reading from a var rather than a file as in the OP's code):

my $s = "\xa2\xf8";

open F\, "\<:utf8"\,\$s; while(\) { s/[\000]+//g; # Causes a SEGV }

outputs:

utf8 "\xA2" does not map to Unicode at /tmp/p3 line 6\, \ line 1. Malformed UTF-8 character (unexpected end of string) in substitution (s///) at /tmp/p3 line 7\, \ line 1. Malformed UTF-8 character (unexpected continuation byte 0xa2\, with no preceding start byte) in substitution (s///) at /tmp/p3 line 7\, \ line 1. Malformed UTF-8 character (unexpected continuation byte 0xa2\, with no preceding start byte) in substitution (s///) at /tmp/p3 line 7\, \ line 1. Malformed UTF-8 character (unexpected non-continuation byte 0x00\, immediately after start byte 0xf8) in substitution (s///) at /tmp/p3 line 7\, \ line 1. Segmentation fault

Presumably feeding in malformed utf8 is tripping something over. I haven't really got time at the moment to look into this further\, so if anyone else wants to volunteer...

-- Britain\, Britain\, Britain! Discovered by Sir Henry Britain in sixteen-oh-ten. Sold to Germany a year later for a pfennig and the promise of a kiss. Destroyed in eighteen thirty-fourty two\, and rebuilt a week later by a man. This we know. Hello. But what of the people of Britain? Who they? What do? And why? -- Little Britain

p5pRT commented 18 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 18 years ago

From @nwc10

On Mon\, Dec 05\, 2005 at 12:33:18PM +0000\, Dave Mitchell wrote:

On Sun\, Dec 04\, 2005 at 03:26:23PM -0800\, Ralph Bolton wrote:

I've managed to track down a problem with Perl 5.8.6 (perl-5.8.6-18\, RPM on Redhat Fedora 4 with kernel 2.6.13-1.1526_FC4). Essentially\, it causes the Perl process to abort with a Segmentation Violation (SEGV).

The code can be reduced to the following (reading from a var rather than a file as in the OP's code):
my $s = "\\xa2\\xf8";

open F\, "\<&#8203;:utf8"\,\\$s;
while$\<F>$ \{
s/\[\\000\]\+//g;        \# Causes a SEGV
\}
outputs:

utf8 "\xA2" does not map to Unicode at /tmp/p3 line 6\, \ line 1. Malformed UTF-8 character (unexpected end of string) in substitution (s///) at /tmp/p3 line 7\, \ line 1. Malformed UTF-8 character (unexpected continuation byte 0xa2\, with no preceding start byte) in substitution (s///) at /tmp/p3 line 7\, \ line 1. Malformed UTF-8 character (unexpected continuation byte 0xa2\, with no preceding start byte) in substitution (s///) at /tmp/p3 line 7\, \ line 1. Malformed UTF-8 character (unexpected non-continuation byte 0x00\, immediately after start byte 0xf8) in substitution (s///) at /tmp/p3 line 7\, \ line 1. Segmentation fault

Presumably feeding in malformed utf8 is tripping something over. I haven't really got time at the moment to look into this further\, so if anyone else wants to volunteer...

I can't get it to reliably crash with that minimal input on FreeBSD. With the original file\, I get:

Program received signal SIGBUS\, Bus error. 0x282a84ab in memmove () from /lib/libc.so.5 (gdb) where #0 0x282a84ab in memmove () from /lib/libc.so.5 #1 0x080d8016 in Perl_pp_subst () at pp_hot.c:2184 #2 0x080bab16 in Perl_runops_debug () at dump.c:1597 #3 0x0806221e in S_run_body (oldscope=1) at perl.c:2308 #4 0x08061db0 in perl_run (my_perl=0x8185030) at perl.c:2235 #5 0x0805d747 in main (argc=4\, argv=0xbfbfebdc\, env=0xbfbfebf0) at perlmain.c:103 (gdb) up #1 0x080d8016 in Perl_pp_subst () at pp_hot.c:2184 2184 Move(s\, d\, i+1\, char); /* include the NUL */ (gdb) p s $1 = 0x819f19b "\006Á×èñ5¸|\rwÉ·\201û\215\224´R"¢ÚFécwäk\022|\021\217¯ÉåkºK¾vî;*\2228ù\222\224>¾À·\205|"Úáy¶µ3\vy\bºyÚÍqÿ\202þî\212þú\005ýo¡ÄÄbÝeÈþ«ðûª\rìåx" (gdb) p d $2 = 0x819f0c7 "\004\024" (gdb) p i $3 = -2

That Move will expand to a call to memmove(d\, s\, i+1) - clearly a length of -1 is bogus.

Nicholas Clark

p5pRT commented 18 years ago

From BQW10602@nifty.com

On Mon\, 5 Dec 2005 12:41:18 +0000\, Nicholas Clark \nick@ccl4\.org wrote

On Mon\, Dec 05\, 2005 at 12:33:18PM +0000\, Dave Mitchell wrote:
On Sun\, Dec 04\, 2005 at 03:26:23PM -0800\, Ralph Bolton wrote:

I've managed to track down a problem with Perl 5.8.6 (perl-5.8.6-18\, RPM on Redhat Fedora 4 with kernel 2.6.13-1.1526_FC4). Essentially\, it causes the Perl process to abort with a Segmentation Violation (SEGV).

The code can be reduced to the following (reading from a var rather than a file as in the OP's code):
my $s = "\\xa2\\xf8";

open F\, "\<&#8203;:utf8"\,\\$s;
while$\<F>$ \{
s/\[\\000\]\+//g;        \# Causes a SEGV
\}
I can't get it to reliably crash with that minimal input on FreeBSD. With the original file\, I get:

#1 0x080d8016 in Perl_pp_subst () at pp_hot.c:2184 2184 Move(s\, d\, i+1\, char); /* include the NUL */

That Move will expand to a call to memmove(d\, s\, i+1) - clearly a length of -1 is bogus.

Nicholas Clark

utf8n_to_uvchr returns 0 for the malformed utf-8. (this behavior is documented for utf8n_to_uvuni.)

Therefore regexec.c:S_reginclass (falsely?) matches [\000] with malformed utf-8. (I didn't see into why that behavior of reginclass made negative i (i = strend - s) in pp_subst afterwards...)

#!perl no warnings 'utf8'; $_ = pack('U0C2'\, 0xa2\, 0xf8); # malformed UTF-8 print s/[\0]+//g ? "match" : "not match"\, "\n"; # prints "match"

Perhaps it may be better reginclass() croaks malformed utf-8. If so\, I think UTF8_ALLOW_ANYUV is preferable to UTF8_ALLOW_ANY\, since UTF8_ALLOW_ANY allows all kinds of malformed utf8.

Regards\, SADAHIRO Tomoyuki

But the error message is not good\, since "Malformed UTF-8 character" is marked with (W utf8) but not (F)...

Inline Patch

```diff --- regexec.c~ Wed Nov 30 23:24:19 2005 +++ regexec.c Tue Dec 06 00:14:56 2005 @@ -4710,9 +4710,13 @@ STRLEN len = 0; STRLEN plen; - if (do_utf8 && !UTF8_IS_INVARIANT(c)) - c = utf8n_to_uvchr(p, UTF8_MAXBYTES, &len, - ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY); + if (do_utf8 && !UTF8_IS_INVARIANT(c)) { + c = utf8n_to_uvchr(p, UTF8_MAXBYTES, &len, + ckWARN(WARN_UTF8) ? UTF8_CHECK_ONLY : + UTF8_ALLOW_ANYUV|UTF8_CHECK_ONLY); + if (len == (STRLEN)-1) + Perl_croak(aTHX_ "Malformed UTF-8 character (fatal)"); + } plen = lenp ? *lenp : UNISKIP(NATIVE_TO_UNI(c)); if (do_utf8 || (flags & ANYOF_UNICODE)) { ### END OF PATCH ```

p5pRT commented 18 years ago

From BQW10602@nifty.com

Oops\, I reply to the prev mail of myself..

utf8n_to_uvchr returns 0 for the malformed utf-8. (this behavior is documented for utf8n_to_uvuni.)

Therefore regexec.c:S_reginclass (falsely?) matches [\000] with malformed utf-8. (I didn't see into why that behavior of reginclass made negative i (i = strend - s) in pp_subst afterwards...)

#!perl ###no warnings 'utf8'; This \ line makes the example inappropiate.

For UTF8_ALLOW_ANY allows all kinds of malformed utf8\, then utf8n_to_uvchr (etc.) doesn't always returns 0 for malformed utf8\, but may return a certain non-zero value.

$_ = pack('U0C2'\, 0xa2\, 0xf8); # malformed UTF-8 print s/[\0]+//g ? "match" : "not match"\, "\n"; # prints "match"

Regards\, SADAHIRO Tomoyuki

p5pRT commented 18 years ago

From @rgs

SADAHIRO Tomoyuki wrote:

utf8n_to_uvchr returns 0 for the malformed utf-8. (this behavior is documented for utf8n_to_uvuni.)

Therefore regexec.c:S_reginclass (falsely?) matches [\000] with malformed utf-8. (I didn't see into why that behavior of reginclass made negative i (i = strend - s) in pp_subst afterwards...)

#!perl no warnings 'utf8'; $_ = pack('U0C2'\, 0xa2\, 0xf8); # malformed UTF-8 print s/[\0]+//g ? "match" : "not match"\, "\n"; # prints "match"

Perhaps it may be better reginclass() croaks malformed utf-8. If so\, I think UTF8_ALLOW_ANYUV is preferable to UTF8_ALLOW_ANY\, since UTF8_ALLOW_ANY allows all kinds of malformed utf8.

OK\, I've applied this as #26258\, thanks.

But the error message is not good\, since "Malformed UTF-8 character" is marked with (W utf8) but not (F)...

It should even be (F) (S utf8) since it's enabled by default. I'll fix.

--- regexec.c~ Wed Nov 30 23:24:19 2005 +++ regexec.c Tue Dec 06 00:14:56 2005 @@ -4710\,9 +4710\,13 @@ STRLEN len = 0; STRLEN plen;

- if (do_utf8 && !UTF8_IS_INVARIANT(c)) - c = utf8n_to_uvchr(p\, UTF8_MAXBYTES\, &len\, - ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY); + if (do_utf8 && !UTF8_IS_INVARIANT(c)) { + c = utf8n_to_uvchr(p\, UTF8_MAXBYTES\, &len\, + ckWARN(WARN_UTF8) ? UTF8_CHECK_ONLY : + UTF8_ALLOW_ANYUV|UTF8_CHECK_ONLY); + if (len == (STRLEN)-1) + Perl_croak(aTHX_ "Malformed UTF-8 character (fatal)"); + }
 plen = lenp ? \*lenp : UNISKIP$NATIVE\_TO\_UNI\(c$\);
 if $do\_utf8 || \(flags & ANYOF\_UNICODE$\) \{
### END OF PATCH

p5pRT commented 18 years ago

@rgs - Status changed from 'open' to 'resolved'

p5pRT commented 18 years ago

From BQW10602@nifty.com

On Sun\, Dec 04\, 2005 at 03:26:23PM -0800\, Ralph Bolton wrote:

I've managed to track down a problem with Perl 5.8.6 (perl-5.8.6-18\, RPM on Redhat Fedora 4 with kernel 2.6.13-1.1526_FC4). Essentially\, it causes the Perl process to abort with a Segmentation Violation (SEGV).

The code can be reduced to the following (reading from a var rather than a file as in the OP's code):
my $s = "\\xa2\\xf8";

open F\, "\<&#8203;:utf8"\,\\$s;
while$\<F>$ \{
     s/\[\\000\]\+//g;          \# Causes a SEGV
\}
utf8n_to_uvchr returns 0 for the malformed utf-8. (this behavior is documented for utf8n_to_uvuni.)

Therefore regexec.c:S_reginclass (falsely?) matches [\000] with malformed utf-8. (I didn't see into why that behavior of reginclass made negative i (i = strend - s) in pp_subst afterwards...)

[perl #37836] has been resolved by the change 26258. Well\, the following is a further investigation.

In the above perl script\, reginclass() against the string was called from find_byclass() first\, and from regrepeat() afterthere. Every time reginclass() resulted in matching a malformed utf8 with NUL.

UTF8SKIP at "\xa2" is 1 and UTF8SKIP at "\xf8" is 5 (see utf8.h).

In find_byclass()\, "\xa2" passed the test of (s + uskip \<= strend). As reginclass() and regtry() returned true values\, the goto statement made the exit of the while loop.

In regrepeat() which was called from the regtry() indirectly\, the pointer "scan" was added by UTF8SKIP twice and pointed to the position after the end "loceol" by 4 octets.

//// regexec.c#find_byclass

case ANYOF: if (do_utf8) { while (s + (uskip = UTF8SKIP(s)) \<= strend) { if ((ANYOF_FLAGS(c) & ANYOF_UNICODE) || !UTF8_IS_INVARIANT((U8)s[0]) ? reginclass(c\, (U8*)s\, 0\, do_utf8) : REGINCLASS(c\, (U8*)s)) { if (tmp && (norun || regtry(prog\, s))) goto got_it;

//// regexec.c#regrepeat

case ANYOF: if (do_utf8) { loceol = PL_regeol; while (hardcount \< max && scan \< loceol && reginclass(p\, (U8*)scan\, 0\, do_utf8)) { scan += UTF8SKIP(scan); hardcount++; }

At last pp_subst() tried to Move() a very huge size of chunk.

//// pp_hot.c#pp_subst

/* can do inplace substitution? */ if (c .... && (!doutf8 || SvUTF8(TARG))) { .... else { //// this else corresponds to "if (once)" .... s = rx->endp[0] + orig; //// * here rx->endp[0] == 6 } while (CALLREGEXEC(aTHX_ rx\, s\, strend\, orig\, s == m\, .... if (s != d) { i = strend - s; //// * here i == -4 SvCUR_set(TARG\, d - SvPVX_const(TARG) + i); Move(s\, d\, i+1\, char); /* include the NUL */ }

Conclusions:

- UTF8SKIP is certainly fast (since it reads only the first octet) but harmful against malformed utf8. - Just returning FALSE for malformed utf8\, instead of croaking\, could avoid SEGV.

Since change 26258 reginclass() croaks malformed utf8. If reginclass() just would return FALSE\, the operation could be continued. Though I don't know operations for malformed utf8 should be continued.

Regards\, SADAHIRO Tomoyuki

p5pRT commented 13 years ago

@khwilliamson - Status changed from 'resolved' to 'open'

p5pRT commented 13 years ago

From @khwilliamson

This bug may be fixed\, but the test for it in t/re/pat_rt_report.t is wrong\, and works only because the single character ANYOF node is being optimized into an EXACT node. If you add any other character to the class\, the optimization goes away and the test fails. It does not cause a segmentation fault\, but you do get the correct error message that this is malformed utf8.

I'm not sure how to fix this without further research.

--Karl Williamson

p5pRT commented 13 years ago

From [Unknown Contact. See original ticket]

This bug may be fixed\, but the test for it in t/re/pat_rt_report.t is wrong\, and works only because the single character ANYOF node is being optimized into an EXACT node. If you add any other character to the class\, the optimization goes away and the test fails. It does not cause a segmentation fault\, but you do get the correct error message that this is malformed utf8.

I'm not sure how to fix this without further research.

--Karl Williamson

p5pRT commented 7 years ago

From @khwilliamson

The bug had been fixed\, but the test was defective\, now fixed by 76024109c948397787a27693df5382b9f8005058 -- Karl Williamson

p5pRT commented 7 years ago

@khwilliamson - Status changed from 'open' to 'pending release'

p5pRT commented 7 years ago

From @khwilliamson

Thank you for filing this report. You have helped make Perl better.

With the release today of Perl 5.26.0\, this and 210 other issues have been resolved.

Perl 5.26.0 may be downloaded via: https://metacpan.org/release/XSAWYERX/perl-5.26.0

If you find that the problem persists\, feel free to reopen this ticket.

p5pRT commented 7 years ago

@khwilliamson - Status changed from 'pending release' to 'resolved'

Perl / perl5