Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.97k stars 555 forks source link

Regex too selfish #6664

Closed p5pRT closed 21 years ago

p5pRT commented 21 years ago

Migrated from rt.perl.org#23171 (status was 'resolved')

Searchable as RT23171$

p5pRT commented 21 years ago

From @abigail

Created by @abigail

While trying to help out on a problem in clpm\, I encountered a regexp bug. If I run the following program​:

  print "foobarbar" =~ /^(.{3\,3})(.+?)(\2+)$/ ? "Yes\n" : "No\n";   print "foobarbar" =~ /^(.{3\,4})(.+?)(\2+)$/ ? "Yes\n" : "No\n";   print "foobarbar" =~ /^(.{3\,4}?)(.+?)(\2+)$/ ? "Yes\n" : "No\n";   print "foobarbar" =~ /^(.{2\,3}?)(.+?)(\2+)$/ ? "Yes\n" : "No\n";

I get with 5.8.0 and 5.8.1-RC2 the output​:

  Yes   No   Yes   No

5.005\, 5.6.0 and 5.6.1 however give the expected​:

  Yes   Yes   Yes   Yes

Abigail

Perl Info ``` Flags: category=core severity=medium Site configuration information for perl v5.8.0: Configured by camel at Mon Oct 28 01:28:45 CET 2002. Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration: Platform: osname=linux, osvers=2.4.18-bf2.4, archname=i686-linux-64int-ld uname='linux alexandra 2.4.18-bf2.4 #1 son apr 14 09:53:28 cest 2002 i686 unknown ' config_args='-des -Uversiononly -Dmydomain=.abigail.nl -Dcf_email=camel@abigail.nl -Dperladmin=camel@abigail.nl -Doptimize=-g -Dusemorebits -Dusedevel -Dusenm=false -Dprefix=/opt/perl -Dcc=gcc' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=define use64bitall=undef uselongdouble=define usemymalloc=n, bincompat5005=undef Compiler: cc='gcc', ccflags ='-DDEBUGGING -fno-strict-aliasing -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-g', cppflags='-DDEBUGGING -fno-strict-aliasing' ccversion='', gccversion='3.0.4', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=12345678 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long long', ivsize=8, nvtype='long double', nvsize=12, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='gcc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -ldl -lm -lc -lcrypt -lutil perllibs=-lnsl -ldl -lm -lc -lcrypt -lutil libc=/lib/libc-2.2.5.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.2.5' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib' Locally applied patches: @INC for perl v5.8.0: /home/abigail/Perl /home/abigail/Sybase /opt/perl/lib/5.8.0/i686-linux-64int-ld /opt/perl/lib/5.8.0 /opt/perl/lib/site_perl/5.8.0/i686-linux-64int-ld /opt/perl/lib/site_perl/5.8.0 /opt/perl/lib/site_perl . Environment for perl v5.8.0: HOME=/home/abigail LANG=C LANGUAGE (unset) LD_LIBRARY_PATH=/home/abigail/Lib:/usr/local/lib:/usr/lib:/lib:/usr/X11R6/lib:/opt/gnome/lib LOGDIR (unset) PATH=/home/abigail/Bin:/opt/perl/bin:/usr/local/bin:/usr/local/X11/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/X11R6/bin:/usr/games:/opt/povray/bin:/opt/gnome/bin:/opt/opera/bin:/usr/share/texmf/bin:/opt/Acrobat/bin:/opt/java/blackdown/j2sdk1.3.1/bin:/usr/local/games/bin:/opt/gnuplot/bin:/opt/mysql/bin PERL5LIB=/home/abigail/Perl:/home/abigail/Sybase PERLDIR=/opt/perl PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 21 years ago

From @tux

On Wed 30 Jul 2003 15​:22\, "abigail@​abigail.nl (via RT)" \perlbug\-followup@​perl\.org wrote​:

While trying to help out on a problem in clpm\, I encountered a regexp bug. If I run the following program​:

print "foobarbar" =~ /^\(\.\{3\,3\}\)\(\.\+?\)\(\\2\+\)$/ ? "Yes\\n" : "No\\n";
print "foobarbar" =~ /^\(\.\{3\,4\}\)\(\.\+?\)\(\\2\+\)$/ ? "Yes\\n" : "No\\n";
print "foobarbar" =~ /^\(\.\{3\,4\}?\)\(\.\+?\)\(\\2\+\)$/ ? "Yes\\n" : "No\\n";
print "foobarbar" =~ /^\(\.\{2\,3\}?\)\(\.\+?\)\(\\2\+\)$/ ? "Yes\\n" : "No\\n";

I get with 5.8.0 and 5.8.1-RC2 the output​:

Yes
No
Yes
No

5.005\, 5.6.0 and 5.6.1 however give the expected​:

Yes
Yes
Yes
Yes

Confirmed​:

/pro/bin/perl5.00503 Yes Yes Yes Yes /pro/bin/perl5.6.1 Yes Yes Yes Yes /pro/bin/perl5.8.0 Yes No Yes No /pro/bin/perl5.9.0 Yes No Yes No

-- H.Merijn Brand Amsterdam Perl Mongers (http​://amsterdam.pm.org/) using perl-5.6.1\, 5.8.0 & 633 on HP-UX 10.20 & 11.00\, AIX 4.2\, AIX 4.3\,   WinNT 4\, Win2K pro & WinCE 2.11. Smoking perl CORE​: smokers@​perl.org http​://archives.develooper.com/daily-build@​perl.org/ perl-qa@​perl.org send smoke reports to​: smokers-reports@​perl.org\, QA​: http​://qa.perl.org

p5pRT commented 21 years ago

From @hvds

"abigail@​abigail.nl (via RT)" \perlbug\-followup@​perl\.org wrote​: [...] : print "foobarbar" =~ /^(.{3\,4})(.+?)(\2+)$/ ? "Yes\n" : "No\n";

I'm working with the slightly simpler​:   /^.{3\,4}(.+)\1\z/s

What happens is that the .{4} branch successively sets $1 to "arbar" through to "a"\, fails and backtracks. However it fails to mark $1 as unmatched; after the REGCP_UNWIND() at (blead) regexec.c​:3817 we have​: (gdb) p ((int*)PL_regstartp)[1] $31 = 4 (gdb) p ((int*)PL_regendp)[1] $32 = 5 (gdb) p *PL_reglastparen $33 = 1 (gdb)

This then causes an optimisation to kick in just after the 'repeat​:' label in S_regmatch()​:   if (PL_regkind[(U8)OP(text_node)] == REF) { yes\, it is   I32 n\, ln;   n = ARG(text_node); /* which paren pair */ pair 1   ln = PL_regstartp[n]; index 4   /* assume yes if we haven't seen CLOSEn */   if (   (I32)*PL_reglastparen \< n || wrong reglastparen\, so this fails   ln == -1 || open index still marked valid\, so this fails   ln == PL_regendp[n] not an empty match (we tried .+\, not .*)\, so this fails   ) {   c1 = c2 = -1000;   goto assume_ok_easy;   } so we assume we can optimise this   s = (U8*)PL_bostr + ln;   }

Andreas\, could you find out at what patchlevel this started failing? That should help point us to the best fix.

Hugo

p5pRT commented 21 years ago

From @andk

13002

----Program---- use strict; use warnings; $\="\n"; print "foobarbar" =~ /^.{3\,4}(.+)\1\z/s ? "ok" : "not ok"

----Output of .../pZH7kgr/perl-5.7.2@​13001/bin/perl---- ok

----EOF ($?='0')---- ----Output of .../pcU5e7z/perl-5.7.2@​13002/bin/perl---- not ok

----EOF ($?='0')----

-- andreas

p5pRT commented 21 years ago

From @hvds

Andreas J Koenig \andreas\.koenig@&#8203;anima\.de wrote​: :13002

Thank you very much. This is the patch that introduced the optimisation\, so here is a choice of patches to fix things.

There are three patches below. The first introduces 16 new tests\, aiming to exercise all the relevant code paths. Only one of the other two should be applied​: the first of those takes the simple approach\, simply removing this aspect of the optimisation entirely; the other attempts to fix the problem by ensuring that prog->lastparen is appropriately reset on backtracking through repeats.

My main worry about that last patch is that I don't understand why of the 6 obvious places to do that resetting\, precisely 5 are necessary to allow the new tests to pass\, and the 6th (commented out) causes a completely different test failure if left in. However all tests pass with the patch as given\, and perhaps that is enough.

Hugo

Inline Patch ```diff --- t/op/re_tests.old Mon Oct 21 03:28:24 2002 +++ t/op/re_tests Thu Jul 31 18:47:06 2003 @@ -924,3 +924,19 @@ (??{}) x y - - a(b)?? abc y <$1> <> # undef [perl #16773] (\d{1,3}\.){3,} 128.134.142.8 y <$1> <142.> # [perl #18019] +^.{3,4}(.+)\1\z foobarbar y $1 bar # 16 tests for [perl #23171] +^(?:f|o|b){3,4}(.+)\1\z foobarbar y $1 bar +^.{3,4}((?:b|a|r)+)\1\z foobarbar y $1 bar +^(?:f|o|b){3,4}((?:b|a|r)+)\1\z foobarbar y $1 bar +^.{3,4}(.+?)\1\z foobarbar y $1 bar +^(?:f|o|b){3,4}(.+?)\1\z foobarbar y $1 bar +^.{3,4}((?:b|a|r)+?)\1\z foobarbar y $1 bar +^(?:f|o|b){3,4}((?:b|a|r)+?)\1\z foobarbar y $1 bar +^.{2,3}?(.+)\1\z foobarbar y $1 bar +^(?:f|o|b){2,3}?(.+)\1\z foobarbar y $1 bar +^.{2,3}?((?:b|a|r)+)\1\z foobarbar y $1 bar +^(?:f|o|b){2,3}?((?:b|a|r)+)\1\z foobarbar y $1 bar +^.{2,3}?(.+?)\1\z foobarbar y $1 bar +^(?:f|o|b){2,3}?(.+?)\1\z foobarbar y $1 bar +^.{2,3}?((?:b|a|r)+?)\1\z foobarbar y $1 bar +^(?:f|o|b){2,3}?((?:b|a|r)+?)\1\z foobarbar y $1 bar --- regexec.c.old Wed Jun 18 21:37:39 2003 +++ regexec.c Thu Jul 31 18:00:34 2003 @@ -3390,19 +3390,8 @@ if (! HAS_TEXT(text_node)) c1 = c2 = -1000; else { if (PL_regkind[(U8)OP(text_node)] == REF) { - I32 n, ln; - n = ARG(text_node); /* which paren pair */ - ln = PL_regstartp[n]; - /* assume yes if we haven't seen CLOSEn */ - if ( - (I32)*PL_reglastparen < n || - ln == -1 || - ln == PL_regendp[n] - ) { - c1 = c2 = -1000; - goto assume_ok_MM; - } - c1 = *(PL_bostr + ln); + c1 = c2 = -1000; + goto assume_ok_MM; } else { c1 = (U8)*STRING(text_node); } if (OP(text_node) == EXACTF || OP(text_node) == REFF) @@ -3472,19 +3461,8 @@ if (! HAS_TEXT(text_node)) c1 = c2 = -1000; else { if (PL_regkind[(U8)OP(text_node)] == REF) { - I32 n, ln; - n = ARG(text_node); /* which paren pair */ - ln = PL_regstartp[n]; - /* assume yes if we haven't seen CLOSEn */ - if ( - (I32)*PL_reglastparen < n || - ln == -1 || - ln == PL_regendp[n] - ) { - c1 = c2 = -1000; - goto assume_ok_REG; - } - c1 = *(PL_bostr + ln); + c1 = c2 = -1000; + goto assume_ok_REG; } else { c1 = (U8)*STRING(text_node); } @@ -3581,19 +3559,8 @@ if (! HAS_TEXT(text_node)) c1 = c2 = -1000; else { if (PL_regkind[(U8)OP(text_node)] == REF) { - I32 n, ln; - n = ARG(text_node); /* which paren pair */ - ln = PL_regstartp[n]; - /* assume yes if we haven't seen CLOSEn */ - if ( - (I32)*PL_reglastparen < n || - ln == -1 || - ln == PL_regendp[n] - ) { - c1 = c2 = -1000; - goto assume_ok_easy; - } - s = (U8*)PL_bostr + ln; + c1 = c2 = -1000; + goto assume_ok_easy; } else { s = (U8*)STRING(text_node); } --- regexec.c.old Wed Jun 18 21:37:39 2003 +++ regexec.c Thu Jul 31 18:52:52 2003 @@ -3351,6 +3351,7 @@ { I32 l = 0; CHECKPOINT lastcp; + I32 lparen = *PL_reglastparen; /* We suppose that the next guy does not need backtracking: in particular, it is of constant length, @@ -3435,6 +3436,8 @@ } if (regmatch(next)) sayYES; + /* t/op/regexp.t test 885 fails if this is performed */ + /* *PL_reglastparen = lparen; */ REGCP_UNWIND(lastcp); } /* Couldn't or didn't -- move forward. */ @@ -3522,6 +3525,7 @@ } if (regmatch(next)) sayYES; + *PL_reglastparen = lparen; REGCP_UNWIND(lastcp); } /* Couldn't or didn't -- back up. */ @@ -3634,6 +3638,7 @@ PL_reginput = locinput; if (minmod) { CHECKPOINT lastcp; + I32 lparen = *PL_reglastparen; minmod = 0; if (ln && regrepeat(scan, ln) < ln) sayNO; @@ -3740,6 +3745,7 @@ if (c == (UV)c1 || c == (UV)c2) { TRYPAREN(paren, ln, PL_reginput); + *PL_reglastparen = lparen; REGCP_UNWIND(lastcp); } } @@ -3747,6 +3753,7 @@ else if (c1 == -1000) { TRYPAREN(paren, ln, PL_reginput); + *PL_reglastparen = lparen; REGCP_UNWIND(lastcp); } /* Couldn't or didn't -- move forward. */ @@ -3761,6 +3768,7 @@ } else { CHECKPOINT lastcp; + I32 lparen = *PL_reglastparen; n = regrepeat(scan, n); locinput = PL_reginput; if (ln < n && PL_regkind[(U8)OP(next)] == EOL && @@ -3791,6 +3799,7 @@ if (c1 == -1000 || c == (UV)c1 || c == (UV)c2) { TRYPAREN(paren, n, PL_reginput); + *PL_reglastparen = lparen; REGCP_UNWIND(lastcp); } /* Couldn't or didn't -- back up. */ @@ -3814,6 +3823,7 @@ if (c1 == -1000 || c == (UV)c1 || c == (UV)c2) { TRYPAREN(paren, n, PL_reginput); + *PL_reglastparen = lparen; REGCP_UNWIND(lastcp); } /* Couldn't or didn't -- back up. */ ```
p5pRT commented 21 years ago

From @hvds

Earlier I wrote​: :There are three patches below. The first introduces 16 new tests\, :aiming to exercise all the relevant code paths. Only one of the :other two should be applied​: the first of those takes the simple :approach\, simply removing this aspect of the optimisation entirely; :the other attempts to fix the problem by ensuring that prog->lastparen :is appropriately reset on backtracking through repeats.

On second thoughts\, I think the second patch should be preferred to the third - the latter (marginally) slows down the matching of every fixed-width quantifier [0] to benefit a relatively rare case\, that of a fixed-width quantifier followed by a backreference. I guess the commonest case that would benefit from the optimisation is the pattern to match simple quoted strings​:   /(["'])(.*?)\1/ but the more complex cases (eg that handle escaping) would usually require CURLYX rather than CURLY or CURLYM\, in which case the optimisation does not apply in any case.

It may be that at a later date we'll discover that other cases can benefit (in either speed or correctness) from relying on an accurate prog->lastparen\, in which case the alternative patch should be revisited.

Hugo [0] that is\, anything that matches a constant-width pattern a variable (or specified constant) number of times​: /.*/\, /[abc]?/ and /(ab|cd){3}/ all qualify\, but not /./ (too simple) nor /(a|bc)*/ (too complex).

p5pRT commented 21 years ago

@rgs - Status changed from 'new' to 'resolved'