Closed p5pRT closed 9 years ago
I cloned the git repo on 01/02/2015 and built from source using the afl-gcc compiler:
CC=/path/to/afl-gcc ./Configure AFL_HARDEN=1 make
This is perl 5\, version 21\, subversion 8 (v5.21.8 (v5.21.7-209-g4e27940)) built for x86_64-linux
Besides the above information\, this version of Perl was compiled using all defaults (enter was pressed for every question).
While fuzzing the new perl binary\, I found a testcase that causes a segfault. I've attached the file for inclusion in the bug report. Please note that a file named = containing zero bytes was also created in a tmp directory around 1 minute 5 seconds before the testcase was written to disk. Not sure if this is related but I wanted to add the information just in case it was important. Please advise if this is a security issue and if you will handle the CVE request.
@@ VALGRIND OUTPUT @@
valgrind -q /home/geeknik/perl5/perl ./id\:000000\\,sig\:11\\,src\:002461+016504\\,op\:splice\\,rep\:32 ==54123== Use of uninitialised value of size 8 ==54123== at 0x651E15: S_regatom (regcomp.c:12632) ==54123== by 0x656B49: S_regpiece (regcomp.c:10821) ==54123== by 0x65AB0D: S_regbranch (regcomp.c:10747) ==54123== by 0x66C9A5: S_reg.constprop.9 (regcomp.c:10497) ==54123== by 0x690D3B: Perl_re_op_compile (regcomp.c:6697) ==54123== by 0x4C1222: Perl_pmruntime (op.c:5629) ==54123== by 0x5CBEBC: Perl_yyparse (perly.y:1000) ==54123== by 0x4F3114: perl_parse (perl.c:2273) ==54123== by 0x42A92B: main (perlmain.c:114) ==54123== ==54123== Invalid write of size 1 ==54123== at 0x64C878: S_regatom (regcomp.c:12484) ==54123== by 0x656B49: S_regpiece (regcomp.c:10821) ==54123== by 0x65AB0D: S_regbranch (regcomp.c:10747) ==54123== by 0x66C9A5: S_reg.constprop.9 (regcomp.c:10497) ==54123== by 0x693D3B: Perl_re_op_compile (regcomp.c:6895) ==54123== by 0x4C1222: Perl_pmruntime (op.c:5629) ==54123== by 0x5CBEBC: Perl_yyparse (perly.y:1000) ==54123== by 0x4F3114: perl_parse (perl.c:2273) ==54123== by 0x42A92B: main (perlmain.c:114) ==54123== Address 0x5eea96c is 0 bytes after a block of size 3\,500 alloc'd ==54123== at 0x4C28BED: malloc (vg_replace_malloc.c:263) ==54123== by 0x6CEFBC: Perl_safesysmalloc (util.c:144) ==54123== by 0x692C7A: Perl_re_op_compile (regcomp.c:6746) ==54123== by 0x4C1222: Perl_pmruntime (op.c:5629) ==54123== by 0x5CBEBC: Perl_yyparse (perly.y:1000) ==54123== by 0x4F3114: perl_parse (perl.c:2273) ==54123== by 0x42A92B: main (perlmain.c:114) ==54123== syntax error at ./id:000000\,sig:11\,src:002461+016504\,op:splice\,rep:32 line 11\, near "printr)" Unknown regexp modifier "/b" at ./id:000000\,sig:11\,src:002461+016504\,op:splice\,rep:32 line 17\, at end of line Unknown regexp modifier "/G" at ./id:000000\,sig:11\,src:002461+016504\,op:splice\,rep:32 line 17\, at end of line Unknown regexp modifier "/e" at ./id:000000\,sig:11\,src:002461+016504\,op:splice\,rep:32 line 17\, at end of line Unknown regexp modifier "/r" at ./id:000000\,sig:11\,src:002461+016504\,op:splice\,rep:32 line 17\, at end of line panic: reg_node overrun trying to emit 0\, 5eea97c>=5eea96c at ./id:000000\,sig:11\,src:002461+016504\,op:splice\,rep:32 line 20.
@@ GDB OUTPUT @@
gdb /home/geeknik/perl5/perl core GNU gdb (GDB) 7.4.1-debian Copyright (C) 2012 Free Software Foundation\, Inc. License GPLv3+: GNU GPL version 3 or later \<http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it. There is NO WARRANTY\, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions\, please see: \<http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /home/geeknik/perl5/perl...done. [New LWP 12797]
warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/geeknik/perl5/perl
./id:000000\,sig:11\,src:002461+016504\,op:splice\,rep:32'.
Program terminated with signal 11\, Segmentation fault.
#0 0x00007f1223db11f0 in _int_free (av=0x7f12240c0e40\, p=0x238f260) at
malloc.c:5002
5002 malloc.c: No such file or directory.
(gdb) bt
#0 0x00007f1223db11f0 in _int_free (av=0x7f12240c0e40\, p=0x238f260) at
malloc.c:5002
#1 0x00007f1223db47bc in *__GI___libc_free (mem=\
geeknik@deb7fuzz:\~/perl5/utils$ ./perlbug -d
Flags: category=core severity=low
This perlbug was built using Perl 5.21.8 - Fri Jan 2 19:02:59 CST 2015 It is being executed now by Perl 5.21.7 - Thu Dec 18 14:34:01 CST 2014.
Site configuration information for perl 5.21.7:
Configured by geeknik at Thu Dec 18 14:34:01 CST 2014.
Summary of my perl5 (revision 5 version 21 subversion 7) configuration: Commit id: e9d2bd8a490981edfc4ddabb5889ca0e86f29e29 Platform: osname=linux\, osvers=3.2.0-4-amd64\, archname=x86_64-linux uname='linux deb7fuzz 3.2.0-4-amd64 #1 smp debian 3.2.63-2+deb7u2 x86_64 gnulinux ' config_args='' hint=recommended\, useposix=true\, d_sigaction=define useithreads=undef\, usemultiplicity=undef use64bitint=define\, use64bitall=define\, uselongdouble=undef usemymalloc=n\, bincompat5005=undef Compiler: cc='/home/geeknik/afl/afl-gcc'\, ccflags ='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2'\, optimize='-O2'\, cppflags='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion=''\, gccversion='4.7.2'\, gccosandvers='' intsize=4\, longsize=8\, ptrsize=8\, doublesize=8\, byteorder=12345678\, doublekind=3 d_longlong=define\, longlongsize=8\, d_longdbl=define\, longdblsize=16\, longdblkind=3 ivtype='long'\, ivsize=8\, nvtype='double'\, nvsize=8\, Off_t='off_t'\, lseeksize=8 alignbytes=8\, prototype=define Linker and Libraries: ld='/home/geeknik/afl/afl-gcc'\, ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /usr/lib/gcc/x86_64-linux-gnu/4.7/include-fixed /usr/include/x86_64-linux-gnu /usr/lib /lib/x86_64-linux-gnu /lib/../lib /usr/lib/x86_64-linux-gnu /usr/lib/../lib /lib libs=-lnsl -ldl -lm -lcrypt -lutil -lc -lpthread perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc -lpthread libc=libc-2.13.so\, so=so\, useshrplib=false\, libperl=libperl.a gnulibc_version='2.13' Dynamic Linking: dlsrc=dl_dlopen.xs\, dlext=so\, d_dlsymun=undef\, ccdlflags='-Wl\,-E' cccdlflags='-fPIC'\, lddlflags='-shared -O2 -L/usr/local/lib -fstack-protector'
@INC for perl 5.21.7: /usr/local/lib/perl5/site_perl/5.21.7/x86_64-linux /usr/local/lib/perl5/site_perl/5.21.7 /usr/local/lib/perl5/5.21.7/x86_64-linux /usr/local/lib/perl5/5.21.7 .
Environment for perl 5.21.7: HOME=/home/geeknik LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games PERL_BADLANG (unset) SHELL=/bin/bash
On Sat Jan 03 11:49:13 2015\, brian.carpenter@gmail.com wrote:
I cloned the git repo on 01/02/2015 and built from source using the afl-gcc compiler:
CC=/path/to/afl-gcc ./Configure AFL_HARDEN=1 make
Can you describe how one obtains the afl-gcc compiler?
This is perl 5\, version 21\, subversion 8 (v5.21.8 (v5.21.7-209-g4e27940)) built for x86_64-linux
Besides the above information\, this version of Perl was compiled using all defaults (enter was pressed for every question).
While fuzzing the new perl binary\, I found a testcase that causes a segfault. I've attached the file for inclusion in the bug report. Please note that a file named = containing zero bytes was also created in a tmp directory around 1 minute 5 seconds before the testcase was written to disk. Not sure if this is related but I wanted to add the information just in case it was important. Please advise if this is a security issue and if you will handle the CVE request.
I'm a bit puzzled about the test case\, since\, insofar as it is intended to be a perl program\, it fails to compile. It reports a syntax error near 'printr)'. Can you clarify?
Thank you very much.
-- James E Keenan (jkeenan@cpan.org)
The RT System itself - Status changed from 'new' to 'open'
Am 03.01.2015 um 22:54 schrieb James E Keenan via RT:
On Sat Jan 03 11:49:13 2015\, brian.carpenter@gmail.com wrote:
I cloned the git repo on 01/02/2015 and built from source using the afl-gcc compiler:
CC=/path/to/afl-gcc ./Configure AFL_HARDEN=1 make
Can you describe how one obtains the afl-gcc compiler?
http://lcamtuf.coredump.cx/afl/
This is perl 5\, version 21\, subversion 8 (v5.21.8 (v5.21.7-209-g4e27940)) built for x86_64-linux
Besides the above information\, this version of Perl was compiled using all defaults (enter was pressed for every question).
While fuzzing the new perl binary\, I found a testcase that causes a segfault. I've attached the file for inclusion in the bug report. Please note that a file named = containing zero bytes was also created in a tmp directory around 1 minute 5 seconds before the testcase was written to disk. Not sure if this is related but I wanted to add the information just in case it was important. Please advise if this is a security issue and if you will handle the CVE request.
I'm a bit puzzled about the test case\, since\, insofar as it is intended to be a perl program\, it fails to compile. It reports a syntax error near 'printr)'. Can you clarify?
As the original bug report says:
syntax error at ./id:000000\,sig:11\,src:002461+016504\,op:splice\,rep:32 line 11\, near "printr)" Unknown regexp modifier "/b" at ./id:000000\,sig:11\,src:002461+016504\,op:splice\,rep:32 line 17\, at end of line Unknown regexp modifier "/G" at ./id:000000\,sig:11\,src:002461+016504\,op:splice\,rep:32 line 17\, at end of line Unknown regexp modifier "/e" at ./id:000000\,sig:11\,src:002461+016504\,op:splice\,rep:32 line 17\, at end of line Unknown regexp modifier "/r" at ./id:000000\,sig:11\,src:002461+016504\,op:splice\,rep:32 line 17\, at end of line panic: reg_node overrun trying to emit 0\, 5eea97c>=5eea96c at ./id:000000\,sig:11\,src:002461+016504\,op:splice\,rep:32 line 20.
Perl panics/crashes while trying to handle a syntax error\, apparently.
-- Lukas Mai \plokinom@​gmail\.com
The original test case that I fed to AFL was this:
#!/usr/local/bin/perl print "Content-type: text/html\n\n"; print "Hello World.\n"; print "Heres the form info:\
\n";
my($buffer);
my(@pairs);
my($pair);
read(STDIN\,$buffer\,$ENV{'CONTENT_LENGTH'});
@pairs = split(/&/\, $buffer);
foreach $pair (@pairs)
{
print "$pair\
\n"
}
print "\
Note that further parsing is\n"; print "necessary to turn the plus signs\n"; print "into spaces and get rid of some\n"; print "other web encoding.\n";
After about 25 million iterations\, that script was turned into the test case that I submitted. Shouldn't perl fail gracefully instead of just crashing because of invalid input/characters/etc?
The afl-gcc is available as part of American Fuzzy Lop which you can obtain from here: http://lcamtuf.coredump.cx/afl/
On Sat Jan 03 14:50:18 2015\, brian.carpenter@gmail.com wrote:
Shouldn't perl fail gracefully instead of just crashing because of invalid input/characters/etc?
Yes\, it should. I donāt get a crash when I run the script. I do get a panic message\, though\, in bleadperl and in 5.20.1. 5.18.3 just shows ānormalā errors\, no panics. Iām going to run a bisect.
--
Father Chrysostomos
On Sat Jan 03 17:35:31 2015\, sprout wrote:
On Sat Jan 03 14:50:18 2015\, brian.carpenter@gmail.com wrote:
Shouldn't perl fail gracefully instead of just crashing because of invalid input/characters/etc?
Yes\, it should. I donāt get a crash when I run the script. I do get a panic message\, though\, in bleadperl and in 5.20.1. 5.18.3 just shows ānormalā errors\, no panics. Iām going to run a bisect.
31f05a37c4e9c37a7263491f2fc0237d836e1a80 is the first bad commit commit 31f05a37c4e9c37a7263491f2fc0237d836e1a80 Author: Karl Williamson \public@​khwilliamson\.com Date: Mon Jan 27 15:35:00 2014 -0700
Work properly under UTF-8 LC_CTYPE locales
This large (sorry\, I couldn't figure out how to meaningfully split it
up) commit causes Perl to fully support LC_CTYPE operations (case
changing\, character classification) in UTF-8 locales.
As a side effect it resolves [perl #56820].
--
Father Chrysostomos
On Sat Jan 03 18:00:41 2015\, sprout wrote:
On Sat Jan 03 17:35:31 2015\, sprout wrote:
On Sat Jan 03 14:50:18 2015\, brian.carpenter@gmail.com wrote:
Shouldn't perl fail gracefully instead of just crashing because of invalid input/characters/etc?
Yes\, it should. I donāt get a crash when I run the script. I do get a panic message\, though\, in bleadperl and in 5.20.1. 5.18.3 just shows ānormalā errors\, no panics. Iām going to run a bisect.
31f05a37c4e9c37a7263491f2fc0237d836e1a80 is the first bad commit commit 31f05a37c4e9c37a7263491f2fc0237d836e1a80 Author: Karl Williamson \public@​khwilliamson\.com Date: Mon Jan 27 15:35:00 2014 -0700
Work properly under UTF\-8 LC\_CTYPE locales This large \(sorry\, I couldn't figure out how to meaningfully split it up\) commit causes Perl to fully support LC\_CTYPE operations \(case changing\, character classification\) in UTF\-8 locales\. As a side effect it resolves \[perl \#56820\]\.
Reduced case. This appears to have nothing to do with the syntax error.
/TffffffffffffTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT5TT TTTTTTTTTTTTTTTTTTTTTTT3TTgTTTTTTTTTTTTTTTTTTTTT2TTTTTTTTTTTTTTTTTTTTTT THHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHiHHHHHH Hffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff ffffffffffffffffffffffffffffffffffffffffffffffffffffffff fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff ffffffffffff&ffff/il
The output I get is:
panic: reg_node overrun trying to emit 0\, 7fcdf140cca8>=7fcdf140cca4 at /Users/sprout/Downloads/crash line 7.
With some variations of the above\, I got malloc errors.
--
Father Chrysostomos
On 01/03/2015 08:39 PM\, Father Chrysostomos via RT wrote:
On Sat Jan 03 18:00:41 2015\, sprout wrote:
On Sat Jan 03 17:35:31 2015\, sprout wrote:
On Sat Jan 03 14:50:18 2015\, brian.carpenter@gmail.com wrote:
Shouldn't perl fail gracefully instead of just crashing because of invalid input/characters/etc?
Yes\, it should. I donāt get a crash when I run the script. I do get a panic message\, though\, in bleadperl and in 5.20.1. 5.18.3 just shows ānormalā errors\, no panics. Iām going to run a bisect.
31f05a37c4e9c37a7263491f2fc0237d836e1a80 is the first bad commit commit 31f05a37c4e9c37a7263491f2fc0237d836e1a80 Author: Karl Williamson \public@​khwilliamson\.com Date: Mon Jan 27 15:35:00 2014 -0700
Work properly under UTF\-8 LC\_CTYPE locales This large \(sorry\, I couldn't figure out how to meaningfully split it up\) commit causes Perl to fully support LC\_CTYPE operations \(case changing\, character classification\) in UTF\-8 locales\. As a side effect it resolves \[perl \#56820\]\.
Reduced case. This appears to have nothing to do with the syntax error.
/TffffffffffffTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT5TT TTTTTTTTTTTTTTTTTTTTTTT3TTgTTTTTTTTTTTTTTTTTTTTT2TTTTTTTTTTTTTTTTTTTTTT THHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHiHHHHHH Hffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff ffffffffffffffffffffffffffffffffffffffffffffffffffffffff fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff ffffffffffff&ffff/il
The output I get is:
panic: reg_node overrun trying to emit 0\, 7fcdf140cca8>=7fcdf140cca4 at /Users/sprout/Downloads/crash line 7.
With some variations of the above\, I got malloc errors.
I didn't get these errors\, but I did see some valgrind issues\, which the attached patch resolves. Please try it out to see if it fixes your problems.
On Sat Jan 03 21:33:22 2015\, public@khwilliamson.com wrote:
On 01/03/2015 08:39 PM\, Father Chrysostomos via RT wrote:
On Sat Jan 03 18:00:41 2015\, sprout wrote:
On Sat Jan 03 17:35:31 2015\, sprout wrote:
On Sat Jan 03 14:50:18 2015\, brian.carpenter@gmail.com wrote:
Shouldn't perl fail gracefully instead of just crashing because of invalid input/characters/etc?
Yes\, it should. I donāt get a crash when I run the script. I do get a panic message\, though\, in bleadperl and in 5.20.1. 5.18.3 just shows ānormalā errors\, no panics. Iām going to run a bisect.
31f05a37c4e9c37a7263491f2fc0237d836e1a80 is the first bad commit commit 31f05a37c4e9c37a7263491f2fc0237d836e1a80 Author: Karl Williamson \public@​khwilliamson\.com Date: Mon Jan 27 15:35:00 2014 -0700
Work properly under UTF-8 LC_CTYPE locales
This large (sorry\, I couldn't figure out how to meaningfully split it up) commit causes Perl to fully support LC_CTYPE operations (case changing\, character classification) in UTF-8 locales.
As a side effect it resolves [perl #56820].
Reduced case. This appears to have nothing to do with the syntax error.
/TffffffffffffTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT5TT TTTTTTTTTTTTTTTTTTTTTTT3TTgTTTTTTTTTTTTTTTTTTTTT2TTTTTTTTTTTTTTTTTTTTTT THHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHiHHHHHH Hffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff ffffffffffffffffffffffffffffffffffffffffffffffffffffffff fffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff ffffffffffff&ffff/il
The output I get is:
panic: reg_node overrun trying to emit 0\, 7fcdf140cca8>=7fcdf140cca4 at /Users/sprout/Downloads/crash line 7.
With some variations of the above\, I got malloc errors.
I didn't get these errors\, but I did see some valgrind issues\, which the attached patch resolves. Please try it out to see if it fixes your problems.
Yes\, it works for me\, both with the reduced case and with the original script from this ticket. I have no idea how it works\, though.
--
Father Chrysostomos
On Sat Jan 03 21:33:22 2015\, public@khwilliamson.com wrote:
I didn't get these errors\, but I did see some valgrind issues\, which the attached patch resolves. Please try it out to see if it fixes your problems.
The patch works for me as well. With it I get normal syntax errors as shown by the attachment.
Thank you very much.
-- James E Keenan (jkeenan@cpan.org)
Bareword found where operator expected at /home/jkeenan/learn/perl/p5p/noshebang-123539-crash line 21\, near "HHHHHHHHHHHHHHHHHHHH" (Might be a runaway multi-line // string starting on line 17) (Missing semicolon on previous line?) syntax error at /home/jkeenan/learn/perl/p5p/noshebang-123539-crash line 11\, near "printr)" Unknown regexp modifier "/b" at /home/jkeenan/learn/perl/p5p/noshebang-123539-crash line 17\, at end of line Unknown regexp modifier "/G" at /home/jkeenan/learn/perl/p5p/noshebang-123539-crash line 17\, at end of line Unknown regexp modifier "/e" at /home/jkeenan/learn/perl/p5p/noshebang-123539-crash line 17\, at end of line Unknown regexp modifier "/r" at /home/jkeenan/learn/perl/p5p/noshebang-123539-crash line 17\, at end of line syntax error at /home/jkeenan/learn/perl/p5p/noshebang-123539-crash line 21\, near "HHHHHHHHHHHHHHHHHHHH" Unrecognized character \x83; marked by \<-- HERE after HHHHHHHHHH\<-- HERE near column 22 at /home/jkeenan/learn/perl/p5p/noshebang-123539-crash line 21.
"Father Chrysostomos via RT" \perlbug\-followup@​perl\.org wrote: :On Sat Jan 03 21:33:22 2015\, public@khwilliamson.com wrote: :> I didn't get these errors\, but I did see some valgrind issues\, which :> the :> attached patch resolves. Please try it out to see if it fixes your :> problems. : :Yes\, it works for me\, both with the reduced case and with the original script from this ticket. I have no idea how it works\, though.
I don't understand the fix either\, but I was able to come up with a somewhat shorter test case:
% ./perl -we '/bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbAAbbb/il' panic: reg_node overrun trying to emit 0\, 1720b50>=1720b50 at -e line 1. %
That's the first hit (len=130\, pos1=125\, pos2=126) from this search:
#!./perl my $lc = 'b'; my $uc = 'A'; # any of [AFHIJSTWY] for my $len (2 .. 408) { for my $pos1 (0 .. $len - 2) { for my $pos2 ($pos1 + 1 .. $len - 1) { my $s = $lc x $len; substr($s\, $_\, 1) = $uc for ($pos1\, $pos2); eval qq{/$s/il}; die "$len: $pos1\, $pos2" if $@; } } }
I also don't know what's special about [AFHIJSTWY]\, but I assume they relate somehow to the PROBLEMATIC_LOCALE_FOLD characters.
With Karl's suggested patch\, the search completes without seeing any problems.
Hugo
On 4 January 2015 at 14:20\, \hv@​crypt\.org wrote:
"Father Chrysostomos via RT" \perlbug\-followup@​perl\.org wrote: :On Sat Jan 03 21:33:22 2015\, public@khwilliamson.com wrote: :> I didn't get these errors\, but I did see some valgrind issues\, which :> the :> attached patch resolves. Please try it out to see if it fixes your :> problems. : :Yes\, it works for me\, both with the reduced case and with the original script from this ticket. I have no idea how it works\, though.
I don't understand the fix either\,
Just to explain\, I am pretty sure the key part of the patch is:
- const STRLEN unilen = reguni(pRExC_state\, ender\, s); + const STRLEN unilen = (SIZE_ONLY && ! FOLD) + ? UNISKIP(ender) + : (uvchr_to_utf8((U8*)s\, ender) - (U8*)s);
The regex engine does two compile passes over the pattern (the first is sometimes repeated so sometimes it is actually three passes)\, the first pass is a sizing pass\, which calculates/estimates an upper limit on the size of the compiled regexp program that will result from the pattern. Once this has been calculated the caller allocates the buffer the program will be written into and then recompiles. During the first pass the SIZE_ONLY flag is set\, and this flag is used in all of the compiler logic to decide what to do.
Prior to Karl's patch it looks like we would try to write to the non-existent program buffer during the first path. With his patch it does the right thing on the first patch.
Hugo: bear in mind that the logic in study_chunk() fires AFTER the final write pass. IOW\, this is all happening well before the optimizer fires. And indeed the multi-pass nature is exactly what I want to change (someday).
cheers\, yves
demerphq \demerphq@​gmail\.com wrote: :On 4 January 2015 at 14:20\, \hv@​crypt\.org wrote: :> "Father Chrysostomos via RT" \perlbug\-followup@​perl\.org wrote: :> :On Sat Jan 03 21:33:22 2015\, public@khwilliamson.com wrote: :> :> I didn't get these errors\, but I did see some valgrind issues\, which :> :> the :> :> attached patch resolves. Please try it out to see if it fixes your :> :> problems. :> : :> :Yes\, it works for me\, both with the reduced case and with the original script from this ticket. I have no idea how it works\, though. :> :> I don't understand the fix either\, : :Just to explain\, I am pretty sure the key part of the patch is: : :- const STRLEN unilen = reguni(pRExC_state\, ender\, s); :+ const STRLEN unilen = (SIZE_ONLY && ! FOLD) :+ ? UNISKIP(ender) :+ : :(uvchr_to_utf8((U8*)s\, ender) - (U8*)s); : :The regex engine does two compile passes over the pattern (the first :is sometimes repeated so sometimes it is actually three passes)\, the :first pass is a sizing pass\, which calculates/estimates an upper limit :on the size of the compiled regexp program that will result from the :pattern. Once this has been calculated the caller allocates the buffer :the program will be written into and then recompiles. During the first :pass the SIZE_ONLY flag is set\, and this flag is used in all of the :compiler logic to decide what to do. : :Prior to Karl's patch it looks like we would try to write to the :non-existent program buffer during the first path. With his patch it :does the right thing on the first patch.
It's the other way round isn't it? reguni() will just return the size when SIZE_ONLY\, but write to s otherwise; the patch makes it also write to s when SIZE_ONLY && !FOLD.
Looking further I see this\, way further up: /* In pass1\, folded\, we use a temporary buffer instead of the * actual node\, as the node doesn't exist yet */ s = (SIZE_ONLY && FOLD) ? foldbuf : STRING(ret); .. at which point it starts to make a little more sense.
Hugo
On 5 January 2015 at 17:28\, \hv@​crypt\.org wrote:
demerphq \demerphq@​gmail\.com wrote: :On 4 January 2015 at 14:20\, \hv@​crypt\.org wrote: :> "Father Chrysostomos via RT" \perlbug\-followup@​perl\.org wrote: :> :On Sat Jan 03 21:33:22 2015\, public@khwilliamson.com wrote: :> :> I didn't get these errors\, but I did see some valgrind issues\, which :> :> the :> :> attached patch resolves. Please try it out to see if it fixes your :> :> problems. :> : :> :Yes\, it works for me\, both with the reduced case and with the original script from this ticket. I have no idea how it works\, though. :> :> I don't understand the fix either\, : :Just to explain\, I am pretty sure the key part of the patch is: : :- const STRLEN unilen = reguni(pRExC_state\, ender\, s); :+ const STRLEN unilen = (SIZE_ONLY && ! FOLD) :+ ? UNISKIP(ender) :+ : :(uvchr_to_utf8((U8*)s\, ender) - (U8*)s); : :The regex engine does two compile passes over the pattern (the first :is sometimes repeated so sometimes it is actually three passes)\, the :first pass is a sizing pass\, which calculates/estimates an upper limit :on the size of the compiled regexp program that will result from the :pattern. Once this has been calculated the caller allocates the buffer :the program will be written into and then recompiles. During the first :pass the SIZE_ONLY flag is set\, and this flag is used in all of the :compiler logic to decide what to do. : :Prior to Karl's patch it looks like we would try to write to the :non-existent program buffer during the first path. With his patch it :does the right thing on the first patch.
It's the other way round isn't it? reguni() will just return the size when SIZE_ONLY\, but write to s otherwise;
Yes\, I missed that.
the patch makes it also write to s when SIZE_ONLY && !FOLD.
Looking further I see this\, way further up: /* In pass1\, folded\, we use a temporary buffer instead of the * actual node\, as the node doesn't exist yet */ s = (SIZE_ONLY && FOLD) ? foldbuf : STRING(ret); .. at which point it starts to make a little more sense.
Indeed. Sorry for the misdirection.
Cheers\, Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
Fixed in blead by 405dffcb17b9cc9d0e5d7b41835b998ca7f1d873 -- Karl Williamson
@khwilliamson - Status changed from 'open' to 'pending release'
On Tue Jan 06 14:11:00 2015\, khw wrote:
Fixed in blead by 405dffcb17b9cc9d0e5d7b41835b998ca7f1d873
which is suitable for a maintenance release.
Here's the text of that commit message:
This is a minimal patch suitable for a maintenance release. It extracts the guts of reguni and REGC without the conditional they have. The next commit will do some refactoring to avoid branching and to make things clearer.
This bug is due to the current two pass structure of the Perl regular expression compiler. The first pass attempts to do just enough work to figure out how much space to malloc for the compiled pattern; and the 2nd pass actually fills in the details. One problem with this design is that in many cases quite a bit of work is required to figure out the size\, and this work is thrown away and redone in the second pass. Another problem is that it is easy to forget to do enough work in the sizing pass\, and that is what happened with the blamed commit. I understand that there are plans (God speed) to change the compiler design.
When not under /i matching\, the size of a node that will match a sequence of characters is just the number of bytes those characters take up. We have an easy way to calculate the number of bytes any code point will occupy in UTF-8\, and it's just 1 byte per code point for non-UTF-8. So in the sizing pass\, we don't actually have to figure out the representation of the characters. However under /i matching\, we do. First of all\, matching of UTF-8 strings is done by replacing each character of each string by its fold-case (function fc()) and then comparing. This is required by the nature of full Unicode matching which is not 1-1. If we do that replacement for the pattern at compile time\, we avoid having to do it over-and-over as pattern matching backtracks at execution. And because fc(x) may not occupy the same number of bytes as x\, and there is no easy way to know that size without actually doing the fc()\, we have to do the fold in the sizing pass. Now\, there are relatively few folds where sizeof(fc(x)) != sizeof(x)\, so we could construct an exception table for those few cases where it is\, and look up through that.
But there is another reason that we have to fold in the sizing pass. And that is because of the potential for multi-character folds being split across regnodes. The regular expression compiler generates EXACTish regnodes for matching sequences of characters exactly or via /i. The limit for how many bytes in a sequence such a node can match is 255 because the length is stored in a U8. If the pattern has a sequence longer than that\, it is split into two or more EXACTish nodes in a row. (Actually\, the compiler splits at a size much lower than that; I'm not sure why\, but then two adjoining nodes whose total sum length is at most 255 get joined later in the third\, optimizing pass.) Now consider\, matching the character U+FB03 LATIN SMALL LIGATURE FFI. It matches the sequence of the three characters "f f i". Because of the design of the regex pattern matching code\, if these characters are such that the first one or two are at the end of one EXACTish node\, and the final two or one are in another EXACTish node\, then U+FB03 wrongly would not match them. Matches can't cross node boundaries. If the pattern were tweaked so all three characters were in either the first or second node\, then the match would succeed. And that is what the compiler does. When it reaches the node's size limit\, and the final character is one that is a non-terminal character in a multi-char fold\, what's in the node is backed-off until it ends with a character without this characteristic. This has to be done in the sizing pass\, as we are repacking the nodes\, which can affect the size of the pattern\, and we have to know what the folds are in order to determine all this.
(We don't fold non-UTF-8 patterns. This is for two reasons. One is that one character\, the U+00B5 MICRO SIGN\, folds to above-Latin1\, and if we folded it\, we would have to change the pattern into UTF-8\, and that would slow everything down. I've thought about adding a regnode type for the much more common case of a sequence that doesn't have this character in it\, and which could hence be folded at compile time. But I've not been able to justify this because of the 2nd reason\, which is folds in this range are simple enough to be handled by an array lookup\, so folding is fast at runtime.)
Then there is the complication of matching under locale rules. This bug manifested itself only under /l matching. We can't fold at pattern compile time\, because the folding rules won't be known until runtime. This isn't a problem for non-UTF-8 locales\, as all folds are 1-1\, and so there never will be a multi-char fold. But there could be such folds in a UTF-8 locale\, so the regnodes have to be packed to work for that eventuality. The blamed commit did not do that\, and because this issue doesn't arise unless there is a string long enough to trigger the problem\, this wasn't found until now. What is needed\, and what this commit does\, is for the unfolded characters to be accumulated in both passes. The code that looks for potential multi-char fold issues handles both folded and unfolded-inputs\, so will work.
On Tue Jan 06 14:11:00 2015\, khw wrote:
Fixed in blead by 405dffcb17b9cc9d0e5d7b41835b998ca7f1d873
which is suitable for a maintenance release.
Here's the text of that commit message:
This is a minimal patch suitable for a maintenance release. It extracts the guts of reguni and REGC without the conditional they have. The next commit will do some refactoring to avoid branching and to make things clearer.
This bug is due to the current two pass structure of the Perl regular expression compiler. The first pass attempts to do just enough work to figure out how much space to malloc for the compiled pattern; and the 2nd pass actually fills in the details. One problem with this design is that in many cases quite a bit of work is required to figure out the size\, and this work is thrown away and redone in the second pass. Another problem is that it is easy to forget to do enough work in the sizing pass\, and that is what happened with the blamed commit. I understand that there are plans (God speed) to change the compiler design.
When not under /i matching\, the size of a node that will match a sequence of characters is just the number of bytes those characters take up. We have an easy way to calculate the number of bytes any code point will occupy in UTF-8\, and it's just 1 byte per code point for non-UTF-8. So in the sizing pass\, we don't actually have to figure out the representation of the characters. However under /i matching\, we do. First of all\, matching of UTF-8 strings is done by replacing each character of each string by its fold-case (function fc()) and then comparing. This is required by the nature of full Unicode matching which is not 1-1. If we do that replacement for the pattern at compile time\, we avoid having to do it over-and-over as pattern matching backtracks at execution. And because fc(x) may not occupy the same number of bytes as x\, and there is no easy way to know that size without actually doing the fc()\, we have to do the fold in the sizing pass. Now\, there are relatively few folds where sizeof(fc(x)) != sizeof(x)\, so we could construct an exception table for those few cases where it is\, and look up through that.
But there is another reason that we have to fold in the sizing pass. And that is because of the potential for multi-character folds being split across regnodes. The regular expression compiler generates EXACTish regnodes for matching sequences of characters exactly or via /i. The limit for how many bytes in a sequence such a node can match is 255 because the length is stored in a U8. If the pattern has a sequence longer than that\, it is split into two or more EXACTish nodes in a row. (Actually\, the compiler splits at a size much lower than that; I'm not sure why\, but then two adjoining nodes whose total sum length is at most 255 get joined later in the third\, optimizing pass.) Now consider\, matching the character U+FB03 LATIN SMALL LIGATURE FFI. It matches the sequence of the three characters "f f i". Because of the design of the regex pattern matching code\, if these characters are such that the first one or two are at the end of one EXACTish node\, and the final two or one are in another EXACTish node\, then U+FB03 wrongly would not match them. Matches can't cross node boundaries. If the pattern were tweaked so all three characters were in either the first or second node\, then the match would succeed. And that is what the compiler does. When it reaches the node's size limit\, and the final character is one that is a non-terminal character in a multi-char fold\, what's in the node is backed-off until it ends with a character without this characteristic. This has to be done in the sizing pass\, as we are repacking the nodes\, which can affect the size of the pattern\, and we have to know what the folds are in order to determine all this.
(We don't fold non-UTF-8 patterns. This is for two reasons. One is that one character\, the U+00B5 MICRO SIGN\, folds to above-Latin1\, and if we folded it\, we would have to change the pattern into UTF-8\, and that would slow everything down. I've thought about adding a regnode type for the much more common case of a sequence that doesn't have this character in it\, and which could hence be folded at compile time. But I've not been able to justify this because of the 2nd reason\, which is folds in this range are simple enough to be handled by an array lookup\, so folding is fast at runtime.)
Then there is the complication of matching under locale rules. This bug manifested itself only under /l matching. We can't fold at pattern compile time\, because the folding rules won't be known until runtime. This isn't a problem for non-UTF-8 locales\, as all folds are 1-1\, and so there never will be a multi-char fold. But there could be such folds in a UTF-8 locale\, so the regnodes have to be packed to work for that eventuality. The blamed commit did not do that\, and because this issue doesn't arise unless there is a string long enough to trigger the problem\, this wasn't found until now. What is needed\, and what this commit does\, is for the unfolded characters to be accumulated in both passes. The code that looks for potential multi-char fold issues handles both folded and unfolded-inputs\, so will work.
On 6 January 2015 at 23:16\, Karl Williamson via RT \perlbug\-comment@​perl\.org wrote:
On Tue Jan 06 14:11:00 2015\, khw wrote:
Fixed in blead by 405dffcb17b9cc9d0e5d7b41835b998ca7f1d873
which is suitable for a maintenance release.
Here's the text of that commit message:
This is a minimal patch suitable for a maintenance release. It extracts the guts of reguni and REGC without the conditional they have. The next commit will do some refactoring to avoid branching and to make things clearer.
++
Nice explanation.
Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
"Karl Williamson via RT" \perlbug\-comment@​perl\.org wrote: :On Tue Jan 06 14:11:00 2015\, khw wrote: :> Fixed in blead by :> 405dffcb17b9cc9d0e5d7b41835b998ca7f1d873 : :which is suitable for a maintenance release. : :Here's the text of that commit message: [snip]
Thanks Karl\, that's very clear.
So I assume we ended up splitting in pass 2 (or 3) resulting in a size growth that should have been anticipated in pass 1 but wasn't\, and this is why the symptom was a "panic: reg_node overrun". Is that right?
I think there'd be value in having a test for this that's more minimal than the original fuzz-generated example: I'm not sure if my shorter [1] case is shortest\, or best-suited\, but it'll be easier (particularly with some additional comments) for the next guy to make sense of if something in this area breaks again in the future.
Hugo
[1] % ./perl -we '/bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbAAbbb/il' panic: reg_node overrun trying to emit 0\, 1720b50>=1720b50 at -e line 1. %
On 01/07/2015 02:04 AM\, hv@crypt.org wrote:
"Karl Williamson via RT" \perlbug\-comment@​perl\.org wrote: :On Tue Jan 06 14:11:00 2015\, khw wrote: :> Fixed in blead by :> 405dffcb17b9cc9d0e5d7b41835b998ca7f1d873 : :which is suitable for a maintenance release. : :Here's the text of that commit message: [snip]
Thanks Karl\, that's very clear.
So I assume we ended up splitting in pass 2 (or 3) resulting in a size growth that should have been anticipated in pass 1 but wasn't\, and this is why the symptom was a "panic: reg_node overrun". Is that right?
The problem was that it was reading uninitialized memory in pass 1. That ended up giving a different result than when the memory was properly initialized in pass2\, so the nodes it was sized for were not correct. I don't known how this translated into the overrun.
I think there'd be value in having a test for this that's more minimal than the original fuzz-generated example: I'm not sure if my shorter [1] case is shortest\, or best-suited\, but it'll be easier (particularly with some additional comments) for the next guy to make sense of if something in this area breaks again in the future.
The reason I went with the test I did was that it had a bunch of f's in a row\, which I know off the top of my head are part of multi-char folds\, whereas yours didn't have any characters that I knew that about\, so I would have had to look it up. Probably the bug is triggered by just exceeding the node size limit\, depending on what's in memory at the time. But\, by being sure that it will have to execute the loop to back off all the f's put some extra stress on the code\, so I went with that. I thought the reference to the bug number in the .t was sufficient comment if someone needed to understand the test. I didn't want to really explain it there\, as it is rather complicated. (It took me a couple hours to write that commit message)
Patches welcome.
Hugo
[1] % ./perl -we '/bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbAAbbb/il' panic: reg_node overrun trying to emit 0\, 1720b50>=1720b50 at -e line 1. %
Karl Williamson \public@​khwilliamson\.com wrote: :On 01/07/2015 02:04 AM\, hv@crypt.org wrote: :> "Karl Williamson via RT" \perlbug\-comment@​perl\.org wrote: :> :On Tue Jan 06 14:11:00 2015\, khw wrote: :> :> Fixed in blead by :> :> 405dffcb17b9cc9d0e5d7b41835b998ca7f1d873 :> : :> :which is suitable for a maintenance release. :> : :> :Here's the text of that commit message: :> [snip] :> :> Thanks Karl\, that's very clear. :> :> So I assume we ended up splitting in pass 2 (or 3) resulting in a size :> growth that should have been anticipated in pass 1 but wasn't\, and this :> is why the symptom was a "panic: reg_node overrun". Is that right? : :The problem was that it was reading uninitialized memory in pass 1.
Ah\, this is the bit I was missing; I understand now why reproducing it (and testing it) usefully is hard.
Hugo
Thanks for submitting this ticket
The issue should be resolved with the release today of Perl v5.22\, available at http://www.perl.org/get.html If you find that the problem persists\, feel free to reopen this ticket
-- Karl Williamson for the Perl 5 porters team
@khwilliamson - Status changed from 'pending release' to 'resolved'
Migrated from rt.perl.org#123539 (status was 'resolved')
Searchable as RT123539$