Closed p5pRT closed 7 years ago
I've managed to track down a problem with Perl 5.8.6 (perl-5.8.6-18\, RPM on Redhat Fedora 4 with kernel 2.6.13-1.1526_FC4). Essentially\, it causes the Perl process to abort with a Segmentation Violation (SEGV). There's no way to stop Perl doing this (meaning you can't put the offending code in an eval() or anything).
The problem seems to be in running a fairly simple regex on a specific string of data. I don't know exactly what the string of data is\, but I have a file that causes the problem.
In case Perlbug has any transmission issues\, see http://www.coofercat.com/wiki/Perl586SegvRegex
_______________________________________________________________
Get the FREE email that has everyone talking at http://www.mail2world.com
Unlimited Email Storage POP3 Calendar SMS Translator Much More!
On Sun\, Dec 04\, 2005 at 03:26:23PM -0800\, Ralph Bolton wrote:
I've managed to track down a problem with Perl 5.8.6 (perl-5.8.6-18\, RPM on Redhat Fedora 4 with kernel 2.6.13-1.1526_FC4). Essentially\, it causes the Perl process to abort with a Segmentation Violation (SEGV).
The code can be reduced to the following (reading from a var rather than a file as in the OP's code):
my $s = "\xa2\xf8";
open F\, "\<:utf8"\,\$s;
while(\
outputs:
utf8 "\xA2" does not map to Unicode at /tmp/p3 line 6\, \
Presumably feeding in malformed utf8 is tripping something over. I haven't really got time at the moment to look into this further\, so if anyone else wants to volunteer...
-- Britain\, Britain\, Britain! Discovered by Sir Henry Britain in sixteen-oh-ten. Sold to Germany a year later for a pfennig and the promise of a kiss. Destroyed in eighteen thirty-fourty two\, and rebuilt a week later by a man. This we know. Hello. But what of the people of Britain? Who they? What do? And why? -- Little Britain
The RT System itself - Status changed from 'new' to 'open'
On Mon\, Dec 05\, 2005 at 12:33:18PM +0000\, Dave Mitchell wrote:
On Sun\, Dec 04\, 2005 at 03:26:23PM -0800\, Ralph Bolton wrote:
I've managed to track down a problem with Perl 5.8.6 (perl-5.8.6-18\, RPM on Redhat Fedora 4 with kernel 2.6.13-1.1526_FC4). Essentially\, it causes the Perl process to abort with a Segmentation Violation (SEGV).
The code can be reduced to the following (reading from a var rather than a file as in the OP's code):
my $s = "\\xa2\\xf8"; open F\, "\<​:utf8"\,\\$s; while\(\<F>\) \{ s/\[\\000\]\+//g; \# Causes a SEGV \}
outputs:
utf8 "\xA2" does not map to Unicode at /tmp/p3 line 6\, \
line 1. Malformed UTF-8 character (unexpected end of string) in substitution (s///) at /tmp/p3 line 7\, \ line 1. Malformed UTF-8 character (unexpected continuation byte 0xa2\, with no preceding start byte) in substitution (s///) at /tmp/p3 line 7\, \ line 1. Malformed UTF-8 character (unexpected continuation byte 0xa2\, with no preceding start byte) in substitution (s///) at /tmp/p3 line 7\, \ line 1. Malformed UTF-8 character (unexpected non-continuation byte 0x00\, immediately after start byte 0xf8) in substitution (s///) at /tmp/p3 line 7\, \ line 1. Segmentation fault Presumably feeding in malformed utf8 is tripping something over. I haven't really got time at the moment to look into this further\, so if anyone else wants to volunteer...
I can't get it to reliably crash with that minimal input on FreeBSD. With the original file\, I get:
Program received signal SIGBUS\, Bus error. 0x282a84ab in memmove () from /lib/libc.so.5 (gdb) where #0 0x282a84ab in memmove () from /lib/libc.so.5 #1 0x080d8016 in Perl_pp_subst () at pp_hot.c:2184 #2 0x080bab16 in Perl_runops_debug () at dump.c:1597 #3 0x0806221e in S_run_body (oldscope=1) at perl.c:2308 #4 0x08061db0 in perl_run (my_perl=0x8185030) at perl.c:2235 #5 0x0805d747 in main (argc=4\, argv=0xbfbfebdc\, env=0xbfbfebf0) at perlmain.c:103 (gdb) up #1 0x080d8016 in Perl_pp_subst () at pp_hot.c:2184 2184 Move(s\, d\, i+1\, char); /* include the NUL */ (gdb) p s $1 = 0x819f19b "\006Á×èñ5¸|\rwÉ·\201û\215\224´R"¢ÚFécwäk\022|\021\217¯ÉåkºK¾vî;*\2228ù\222\224>¾À·\205|"Úáy¶µ3\vy\bºyÚÍqÿ\202þî\212þú\005ýo¡ÄÄbÝeÈþ«ðûª\rìåx" (gdb) p d $2 = 0x819f0c7 "\004\024" (gdb) p i $3 = -2
That Move will expand to a call to memmove(d\, s\, i+1) - clearly a length of -1 is bogus.
Nicholas Clark
On Mon\, 5 Dec 2005 12:41:18 +0000\, Nicholas Clark \nick@​ccl4\.org wrote
On Mon\, Dec 05\, 2005 at 12:33:18PM +0000\, Dave Mitchell wrote:
On Sun\, Dec 04\, 2005 at 03:26:23PM -0800\, Ralph Bolton wrote:
I've managed to track down a problem with Perl 5.8.6 (perl-5.8.6-18\, RPM on Redhat Fedora 4 with kernel 2.6.13-1.1526_FC4). Essentially\, it causes the Perl process to abort with a Segmentation Violation (SEGV).
The code can be reduced to the following (reading from a var rather than a file as in the OP's code):
my $s = "\\xa2\\xf8"; open F\, "\<​:utf8"\,\\$s; while\(\<F>\) \{ s/\[\\000\]\+//g; \# Causes a SEGV \}
I can't get it to reliably crash with that minimal input on FreeBSD. With the original file\, I get:
#1 0x080d8016 in Perl_pp_subst () at pp_hot.c:2184 2184 Move(s\, d\, i+1\, char); /* include the NUL */
That Move will expand to a call to memmove(d\, s\, i+1) - clearly a length of -1 is bogus.
Nicholas Clark
utf8n_to_uvchr returns 0 for the malformed utf-8. (this behavior is documented for utf8n_to_uvuni.)
Therefore regexec.c:S_reginclass (falsely?) matches [\000] with malformed utf-8. (I didn't see into why that behavior of reginclass made negative i (i = strend - s) in pp_subst afterwards...)
#!perl no warnings 'utf8'; $_ = pack('U0C2'\, 0xa2\, 0xf8); # malformed UTF-8 print s/[\0]+//g ? "match" : "not match"\, "\n"; # prints "match"
Perhaps it may be better reginclass() croaks malformed utf-8. If so\, I think UTF8_ALLOW_ANYUV is preferable to UTF8_ALLOW_ANY\, since UTF8_ALLOW_ANY allows all kinds of malformed utf8.
Regards\, SADAHIRO Tomoyuki
But the error message is not good\, since "Malformed UTF-8 character" is marked with (W utf8) but not (F)...
Oops\, I reply to the prev mail of myself..
utf8n_to_uvchr returns 0 for the malformed utf-8. (this behavior is documented for utf8n_to_uvuni.)
Therefore regexec.c:S_reginclass (falsely?) matches [\000] with malformed utf-8. (I didn't see into why that behavior of reginclass made negative i (i = strend - s) in pp_subst afterwards...)
#!perl ###no warnings 'utf8'; This \
line makes the example inappropiate.
For UTF8_ALLOW_ANY allows all kinds of malformed utf8\, then utf8n_to_uvchr (etc.) doesn't always returns 0 for malformed utf8\, but may return a certain non-zero value.
$_ = pack('U0C2'\, 0xa2\, 0xf8); # malformed UTF-8 print s/[\0]+//g ? "match" : "not match"\, "\n"; # prints "match"
Regards\, SADAHIRO Tomoyuki
SADAHIRO Tomoyuki wrote:
utf8n_to_uvchr returns 0 for the malformed utf-8. (this behavior is documented for utf8n_to_uvuni.)
Therefore regexec.c:S_reginclass (falsely?) matches [\000] with malformed utf-8. (I didn't see into why that behavior of reginclass made negative i (i = strend - s) in pp_subst afterwards...)
#!perl no warnings 'utf8'; $_ = pack('U0C2'\, 0xa2\, 0xf8); # malformed UTF-8 print s/[\0]+//g ? "match" : "not match"\, "\n"; # prints "match"
Perhaps it may be better reginclass() croaks malformed utf-8. If so\, I think UTF8_ALLOW_ANYUV is preferable to UTF8_ALLOW_ANY\, since UTF8_ALLOW_ANY allows all kinds of malformed utf8.
OK\, I've applied this as #26258\, thanks.
But the error message is not good\, since "Malformed UTF-8 character" is marked with (W utf8) but not (F)...
It should even be (F) (S utf8) since it's enabled by default. I'll fix.
--- regexec.c~ Wed Nov 30 23:24:19 2005 +++ regexec.c Tue Dec 06 00:14:56 2005 @@ -4710\,9 +4710\,13 @@ STRLEN len = 0; STRLEN plen;
- if (do_utf8 && !UTF8_IS_INVARIANT(c)) - c = utf8n_to_uvchr(p\, UTF8_MAXBYTES\, &len\, - ckWARN(WARN_UTF8) ? 0 : UTF8_ALLOW_ANY); + if (do_utf8 && !UTF8_IS_INVARIANT(c)) { + c = utf8n_to_uvchr(p\, UTF8_MAXBYTES\, &len\, + ckWARN(WARN_UTF8) ? UTF8_CHECK_ONLY : + UTF8_ALLOW_ANYUV|UTF8_CHECK_ONLY); + if (len == (STRLEN)-1) + Perl_croak(aTHX_ "Malformed UTF-8 character (fatal)"); + }
plen = lenp ? \*lenp : UNISKIP\(NATIVE\_TO\_UNI\(c\)\); if \(do\_utf8 || \(flags & ANYOF\_UNICODE\)\) \{
### END OF PATCH
@rgs - Status changed from 'open' to 'resolved'
On Sun\, Dec 04\, 2005 at 03:26:23PM -0800\, Ralph Bolton wrote:
I've managed to track down a problem with Perl 5.8.6 (perl-5.8.6-18\, RPM on Redhat Fedora 4 with kernel 2.6.13-1.1526_FC4). Essentially\, it causes the Perl process to abort with a Segmentation Violation (SEGV).
The code can be reduced to the following (reading from a var rather than a file as in the OP's code):
my $s = "\\xa2\\xf8"; open F\, "\<​:utf8"\,\\$s; while\(\<F>\) \{ s/\[\\000\]\+//g; \# Causes a SEGV \}
utf8n_to_uvchr returns 0 for the malformed utf-8. (this behavior is documented for utf8n_to_uvuni.)
Therefore regexec.c:S_reginclass (falsely?) matches [\000] with malformed utf-8. (I didn't see into why that behavior of reginclass made negative i (i = strend - s) in pp_subst afterwards...)
[perl #37836] has been resolved by the change 26258. Well\, the following is a further investigation.
In the above perl script\, reginclass() against the string was called from find_byclass() first\, and from regrepeat() afterthere. Every time reginclass() resulted in matching a malformed utf8 with NUL.
UTF8SKIP at "\xa2" is 1 and UTF8SKIP at "\xf8" is 5 (see utf8.h).
In find_byclass()\, "\xa2" passed the test of (s + uskip \<= strend). As reginclass() and regtry() returned true values\, the goto statement made the exit of the while loop.
In regrepeat() which was called from the regtry() indirectly\, the pointer "scan" was added by UTF8SKIP twice and pointed to the position after the end "loceol" by 4 octets.
//// regexec.c#find_byclass
case ANYOF: if (do_utf8) { while (s + (uskip = UTF8SKIP(s)) \<= strend) { if ((ANYOF_FLAGS(c) & ANYOF_UNICODE) || !UTF8_IS_INVARIANT((U8)s[0]) ? reginclass(c\, (U8*)s\, 0\, do_utf8) : REGINCLASS(c\, (U8*)s)) { if (tmp && (norun || regtry(prog\, s))) goto got_it;
//// regexec.c#regrepeat
case ANYOF: if (do_utf8) { loceol = PL_regeol; while (hardcount \< max && scan \< loceol && reginclass(p\, (U8*)scan\, 0\, do_utf8)) { scan += UTF8SKIP(scan); hardcount++; }
At last pp_subst() tried to Move() a very huge size of chunk.
//// pp_hot.c#pp_subst
/* can do inplace substitution? */ if (c .... && (!doutf8 || SvUTF8(TARG))) { .... else { //// this else corresponds to "if (once)" .... s = rx->endp[0] + orig; //// * here rx->endp[0] == 6 } while (CALLREGEXEC(aTHX_ rx\, s\, strend\, orig\, s == m\, .... if (s != d) { i = strend - s; //// * here i == -4 SvCUR_set(TARG\, d - SvPVX_const(TARG) + i); Move(s\, d\, i+1\, char); /* include the NUL */ }
Conclusions:
- UTF8SKIP is certainly fast (since it reads only the first octet) but harmful against malformed utf8. - Just returning FALSE for malformed utf8\, instead of croaking\, could avoid SEGV.
Since change 26258 reginclass() croaks malformed utf8. If reginclass() just would return FALSE\, the operation could be continued. Though I don't know operations for malformed utf8 should be continued.
Regards\, SADAHIRO Tomoyuki
@khwilliamson - Status changed from 'resolved' to 'open'
This bug may be fixed\, but the test for it in t/re/pat_rt_report.t is wrong\, and works only because the single character ANYOF node is being optimized into an EXACT node. If you add any other character to the class\, the optimization goes away and the test fails. It does not cause a segmentation fault\, but you do get the correct error message that this is malformed utf8.
I'm not sure how to fix this without further research.
--Karl Williamson
This bug may be fixed\, but the test for it in t/re/pat_rt_report.t is wrong\, and works only because the single character ANYOF node is being optimized into an EXACT node. If you add any other character to the class\, the optimization goes away and the test fails. It does not cause a segmentation fault\, but you do get the correct error message that this is malformed utf8.
I'm not sure how to fix this without further research.
--Karl Williamson
The bug had been fixed\, but the test was defective\, now fixed by 76024109c948397787a27693df5382b9f8005058 -- Karl Williamson
@khwilliamson - Status changed from 'open' to 'pending release'
Thank you for filing this report. You have helped make Perl better.
With the release today of Perl 5.26.0\, this and 210 other issues have been resolved.
Perl 5.26.0 may be downloaded via: https://metacpan.org/release/XSAWYERX/perl-5.26.0
If you find that the problem persists\, feel free to reopen this ticket.
@khwilliamson - Status changed from 'pending release' to 'resolved'
Migrated from rt.perl.org#37836 (status was 'resolved')
Searchable as RT37836$