Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.94k stars 554 forks source link

5.6.1 breaks something with /g and or \G regexps #3848

Closed p5pRT closed 20 years ago

p5pRT commented 23 years ago

Migrated from rt.perl.org#6796 (status was 'resolved')

Searchable as RT6796$

p5pRT commented 23 years ago

From rdw@imdb.com

This simple script demonstrates the problem :-

#! /usr/bin/perl -w

use strict;

my $input = "a{b}c{d}";

while ($input =~ /(.*?)\{(.*?)\}/csg) {   print "\$1 = '$1' \$2 = '$2'\n"; } print "Rest = '$1'\n" if $input =~ m/\G(.+)$/csg;

On perl 5.6.0 and perl 5.005_03 this produces this (correct) output...

================================================== $1 = 'a' $2 = 'b' $1 = 'c' $2 = 'd'

On perl 5.6.1 we get this...

================================================== $1 = 'a' $2 = 'b' Rest = 'c{d}'

Adding a '\G' to the start of the first regexp fixes the problem.

Here is a run of 'perl -Mre=debug ./test.pl'...

================================================== Freeing REx​: `\,' Compiling REx `(.*?)\{(.*?)\}' size 19 first at 5   1​: OPEN1(3)   3​: MINMOD(4)   4​: STAR(6)   5​: SANY(0)   6​: CLOSE1(8)   8​: EXACT \<{>(10)   10​: OPEN2(12)   12​: MINMOD(13)   13​: STAR(15)   14​: SANY(0)   15​: CLOSE2(17)   17​: EXACT \<}>(19)   19​: END(0) floating `{' at 0..2147483647 (checking floating) anchored(SBOL) implicit minlen 2 Compiling REx `\G(.+)$' size 9 first at 2   1​: GPOS(2)   2​: OPEN1(4)   4​: PLUS(6)   5​: SANY(0)   6​: CLOSE1(8)   8​: SEOL(9)   9​: END(0) floating `'$ at 1..2147483647 (checking floating) anchored(GPOS) GPOS minlen 1 Guessing start of match\, REx `(.*?)\{(.*?)\}' against `a{b}c{d}'... Found floating substr `{' at offset 1... Guessed​: match at offset 0 Matching REx `(.*?)\{(.*?)\}' against `a{b}c{d}'   Setting an EVAL scope\, savestack=11   0 \<> \<a{b}c{d}> | 1​: OPEN1   0 \<> \<a{b}c{d}> | 3​: MINMOD   0 \<> \<a{b}c{d}> | 4​: STAR   Setting an EVAL scope\, savestack=11   0 \<> \<a{b}c{d}> | 6​: CLOSE1   0 \<> \<a{b}c{d}> | 8​: EXACT \<{>   failed...   SANY can match 1 times out of 1...   1 \ \<{b}c{d}> | 6​: CLOSE1   1 \ \<{b}c{d}> | 8​: EXACT \<{>   2 \<a{> \<b}c{d}> | 10​: OPEN2   2 \<a{> \<b}c{d}> | 12​: MINMOD   2 \<a{> \<b}c{d}> | 13​: STAR   Setting an EVAL scope\, savestack=11   2 \<a{> \<b}c{d}> | 15​: CLOSE2   2 \<a{> \<b}c{d}> | 17​: EXACT \<}>   failed...   SANY can match 1 times out of 1...   3 \<a{b> \<}c{d}> | 15​: CLOSE2   3 \<a{b> \<}c{d}> | 17​: EXACT \<}>   4 \<a{b}> \<c{d}> | 19​: END Match successful! $1 = 'a' $2 = 'b' Guessing start of match\, REx `(.*?)\{(.*?)\}' against `c{d}'... Not at start... Match rejected by optimizer Matching REx `\G(.+)$' against `c{d}'   Setting an EVAL scope\, savestack=8   4 \<a{b}> \<c{d}> | 1​: GPOS   4 \<a{b}> \<c{d}> | 2​: OPEN1   4 \<a{b}> \<c{d}> | 4​: PLUS   SANY can match 4 times out of 32767...   Setting an EVAL scope\, savestack=8   8 \<a{b}c{d}> \<> | 6​: CLOSE1   8 \<a{b}c{d}> \<> | 8​: SEOL   8 \<a{b}c{d}> \<> | 9​: END Match successful! Rest = 'c{d}' Freeing REx​: `(.*?)\{(.*?)\}' Freeing REx​: `\G(.+)$'

Looks like a possible problem in the regexp optimizer?

Have fun\,

Rich Williams

Perl Info ``` Flags: category=core severity=medium Site configuration information for perl v5.6.1: Configured by rdw at Mon Apr 9 14:43:34 BST 2001. Summary of my perl5 (revision 5.0 version 6 subversion 1) configuration: Platform: osname=linux, osvers=2.2.16-3smp, archname=i686-linux uname='linux rdw.imdb.com 2.2.16-3smp #1 smp mon jun 19 19:00:35 edt 2000 i686 unknown ' config_args='-de -Dprefix=/usr' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=undef d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef Compiler: cc='cc', ccflags ='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-O2', cppflags='-fno-strict-aliasing -I/usr/local/include' ccversion='', gccversion='egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, usemymalloc=n, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -lndbm -lgdbm -ldb -ldl -lm -lc -lposix -lcrypt -lutil perllibs=-lnsl -ldl -lm -lc -lposix -lcrypt -lutil libc=/lib/libc-2.1.3.so, so=so, useshrplib=false, libperl=libperl.a Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib' Locally applied patches: @INC for perl v5.6.1: /usr/lib/perl5/5.6.1/i686-linux /usr/lib/perl5/5.6.1 /usr/lib/perl5/site_perl/5.6.1/i686-linux /usr/lib/perl5/site_perl/5.6.1 /usr/lib/perl5/site_perl/5.6.0 /usr/lib/perl5/site_perl/5.005 /usr/lib/perl5/site_perl . Environment for perl v5.6.1: HOME=/home/rdw LANG=C LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/rdw/bin:/usr/local/bin:/usr/kerberos/bin:/home/rdw/bin:/usr/local/bin:/bin:/usr/bin:/usr/bin/X11:/usr/local/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/sbin:/usr/X11R6/bin:/sbin:/usr/sbin:/usr/local/sbin PERL_BADLANG (unset) SHELL=/bin/tcsh ```
p5pRT commented 23 years ago

From @gsar

The bug seems to show up only when using /s in conjunction with /g and C\<.*?>\, so I'd say it's not a show-stopper.

The attached tests fail #1 and #3 with 5.6.1. Does 5.7.1 behave the same way?

I'd be grateful if someone with the tuits can build perl at various points between 5.6.0 and 5.6.1 that changed reg*.[hc] to see which patch introduced the problem. The patches are here​:

  http​://public.ActiveState.com/gsar/APC/5.6.1/diffs/

Thanks.

Sarathy gsar@​ActiveState.com

Inline Patch ```diff -----------------------------------8<----------------------------------- my($input, $i, $test); $i = 0; $input = "a{b}c{d}"; while ($input =~ /(.*?)\{(.*?)\}/csg) { print "# \$1 = '$1' \$2 = '$2'\n"; ++$i; } print "not " unless $i == 2; print "ok " . ++$test . "\n"; $i = 0; $input = "a{b}c{d}"; while ($input =~ /(.*?)\{(.*?)\}/cg) { print "# \$1 = '$1' \$2 = '$2'\n"; ++$i; } print "not " unless $i == 2; print "ok " . ++$test . "\n"; $i = 0; $input = "a{b}c{d}"; while ($input =~ /(.*?)\{(.*?)\}/sg) { print "# \$1 = '$1' \$2 = '$2'\n"; ++$i; } print "not " unless $i == 2; print "ok " . ++$test . "\n"; $i = 0; $input = "a{b}c{d}"; while ($input =~ /(.*?)\{(.*?)\}/g) { print "# \$1 = '$1' \$2 = '$2'\n"; ++$i; } print "not " unless $i == 2; print "ok " . ++$test . "\n"; $i = 0; $input = "a{b}c{d}"; while ($input =~ /(.+?)\{(.+?)\}/csg) { print "# \$1 = '$1' \$2 = '$2'\n"; ++$i; } print "not " unless $i == 2; print "ok " . ++$test . "\n";
p5pRT commented 23 years ago

From @gsar

Good work\, thanks.

And then lines 7950-7988 of 8156.gz (again\, just the patch to regexec.c\, and not the bit which deals with some unicode stuff)

Here's the actual change​:

  http​://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-09/msg01297.html

which got checked in as change#7115​:

  http​://public.ActiveState.com/gsar/APC/5.7.1/diffs/7115.gz

"Minor" optimization\, no doubt. ;-)

Sarathy gsar@​ActiveState.com

p5pRT commented 23 years ago

From @vanstyn

Here's a fix\, over 5.6.1. Not sure if this could be improved by having such patterns upgraded to ROPT_ANCH_GPOS | ROPT_IMPLICIT instead of ROPT_ANCH_SBOL | ROPT_IMPLICIT - I suspect the optimisation predates the existence of GPOS.

Hugo

Inline Patch ```diff --- regexec.c.old Thu Mar 22 05:05:02 2001 +++ regexec.c Wed Apr 11 11:23:11 2001 @@ -386,7 +386,8 @@ && !PL_multiline ) ); /* Check after \n? */ if (!ml_anch) { - if ( !(prog->reganch & ROPT_ANCH_GPOS) /* Checked by the caller */ + if ( !(prog->reganch & (ROPT_ANCH_GPOS /* Checked by the caller */ + | ROPT_IMPLICIT)) /* not a real BOL */ /* SvCUR is not set on references: SvRV and SvPVX overlap */ && sv && !SvROK(sv) && (strpos != strbeg)) { --- t/op/pat.t.old Thu Mar 15 15:25:20 2001 +++ t/op/pat.t Wed Apr 11 11:36:41 2001 @@ -4,7 +4,7 @@ # the format supported by op/regexp.t. If you want to add a test # that does fit that format, add it to op/re_tests, not here. -print "1..231\n"; +print "1..232\n"; BEGIN { chdir 't' if -d 't'; @@ -1125,6 +1125,15 @@ () = (my $str = "abc") =~ /(...)/; $str = "def"; print "not " if $1 ne "abc"; + print "ok $test\n"; + $test++; +} + +# bugid 20010410.006 - intuit_start failed with 'not at start' on second iter +{ + my @r; + push @r, $1, $2 while "a{b}c{d}" =~ /(.*?)\{(.*?)}/csg; + "a b c d" eq join " ", @r or print "not "; print "ok $test\n"; $test++; } ```
p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

I've managed to isolate this down to 2 patches\, although I think it's probably just the later one (but it won't patch unless the first patch is applied).

I took a clean 5.6.0 and just applied the lines 31-54 of 7262.gz (just the patch to regexec.c) ...

============================================================ Index​: perl/regexec.c

Inline Patch ```diff --- perl/regexec.c.~1~ Sat Dec 16 00:25:18 2000 +++ perl/regexec.c Sat Dec 16 00:25:18 2000 @@ -346,7 +346,9 @@ I32 slen; if ( !(prog->reganch & ROPT_ANCH_GPOS) /* Checked by the caller */ - && (sv && (strpos + SvCUR(sv) != strend)) ) { + /* SvCUR is not set on references: SvRV and SvPVX overlap */ + && sv && !SvROK(sv) + && (strpos + SvCUR(sv) != strend)) { DEBUG_r(PerlIO_printf(Perl_debug_log, "Not at start...\n")); goto fail; } @@ -638,7 +640,7 @@ try_at_start: /* Even in this situation we may use MBOL flag if strpos is offset wrt the start of the string. */ - if (ml_anch && sv + if (ml_anch && sv && !SvROK(sv) /* See prev comment on SvROK */ && (strpos + SvCUR(sv) != strend) && strpos[-1] != '\n' /* May be due to an implicit anchor of m{.*foo} */ && !(prog->reganch & ROPT_IMPLICIT)) ```

============================================================

And then lines 7950-7988 of 8156.gz (again\, just the patch to regexec.c\, and not the bit which deals with some unicode stuff)

============================================================ Index​: perl/regexec.c

Inline Patch ```diff --- perl/regexec.c.~1~ Sun Dec 17 14:53:38 2000 +++ perl/regexec.c Sun Dec 17 14:53:38 2000 @@ -357,17 +357,18 @@ || ( (prog->reganch & ROPT_ANCH_BOL) && !PL_multiline ) ); /* Check after \n? */ - if ((prog->check_offset_min == prog->check_offset_max) && !ml_anch) { + if (!ml_anch) { + if ( !(prog->reganch & ROPT_ANCH_GPOS) /* Checked by the caller */ + /* SvCUR is not set on references: SvRV and SvPVX overlap */ + && sv && !SvROK(sv) + && (strpos + SvCUR(sv) != strend)) { + DEBUG_r(PerlIO_printf(Perl_debug_log, "Not at start...\n")); + goto fail; + } + if (prog->check_offset_min == prog->check_offset_max) { /* Substring at constant offset from beg-of-str... */ I32 slen; - if ( !(prog->reganch & ROPT_ANCH_GPOS) /* Checked by the caller */ - /* SvCUR is not set on references: SvRV and SvPVX overlap */ - && sv && !SvROK(sv) - && (strpos + SvCUR(sv) != strend)) { - DEBUG_r(PerlIO_printf(Perl_debug_log, "Not at start...\n")); - goto fail; - } PL_regeol = strend; /* Used in HOP() */ s = HOPc(strpos, prog->check_offset_min); if (SvTAIL(check)) { @@ -393,6 +394,7 @@ && memNE(SvPVX(check), s, slen))) goto report_neq; goto success_at_start; + } } /* Match is anchored, but substr is not anchored wrt beg-of-str. */ s = strpos; ```

============================================================

After these two patches have been applied to 5.6.0\, then tests #1 and #3 in your improved test script fail.

Hope this helps.

Have fun\,

Rich

p5pRT commented 20 years ago

From @smpeters

This simple script demonstrates the problem :-

#! /usr/bin/perl -w

use strict;

my $input = "a{b}c{d}";

while ($input =~ /(.*?)\{(.*?)\}/csg) { print "\$1 = '$1' \$2 = '$2'\n"; } print "Rest = '$1'\n" if $input =~ m/\G(.+)$/csg;

On perl 5.6.0 and perl 5.005_03 this produces this (correct) output...

================================================== $1 = 'a' $2 = 'b' $1 = 'c' $2 = 'd'

On perl 5.6.1 we get this...

================================================== $1 = 'a' $2 = 'b' Rest = 'c{d}'

I tried this in Perl 5.8.2 and 5.8.5\, and it seems to be working as expected again.

p5pRT commented 20 years ago

@rspier - Status changed from 'open' to 'resolved'