Perl / perl5

đŸȘ The Perl programming language
https://dev.perl.org/perl5/
Other
1.85k stars 527 forks source link

/g failure with zero-width patterns #9831

Closed p5pRT closed 13 years ago

p5pRT commented 14 years ago

Migrated from rt.perl.org#68564 (status was 'resolved')

Searchable as RT68564$

p5pRT commented 14 years ago

From @ikegami

Created by @ikegami

A regression was introduced into 5.10.0 concerning /g and zero-width patterns. The demo speaks for itself​:

c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" abc bc c

c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" abc

I expect the 5.8 behaviour. 5.6.x and 5.8.x behave like 5.8.9. I've been told 5.10.1-RC1 behaves like 5.10.0. I don't know about blead.

Eric "ikegami" Brine

Perl Info ``` Flags: category=core severity=low Site configuration information for perl 5.10.0: Configured by SYSTEM at Wed Sep 3 13:16:08 2008. Summary of my perl5 (revision 5 version 10 subversion 0) configuration: Platform: osname=MSWin32, osvers=5.00, archname=MSWin32-x86-multi-thread uname='' config_args='undef' hint=recommended, useposix=true, d_sigaction=undef useithreads=define, usemultiplicity=define useperlio=define, d_sfio=undef, uselargefiles=define, usesocks=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cl', ccflags ='-nologo -GF -W3 -MD -Zi -DNDEBUG -O1 -DWIN32 -D_CONSOLE -DNO_STRICT -DHAVE_DES_FCRYPT -DUSE_SITECUSTOMIZE -DPRIVLIB_LAST_IN_INC -DPERL_IMPLICIT_CONTEXT -DPERL_IMPLICIT_SYS -DUSE_PERLIO -DPERL_MSVCRT_READFIX', optimize='-MD -Zi -DNDEBUG -O1', cppflags='-DWIN32' ccversion='12.00.8804', gccversion='', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=undef, longlongsize=8, d_longdbl=define, longdblsize=10 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='__int64', lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='link', ldflags ='-nologo -nodefaultlib -debug -opt:ref,icf -libpath:"C:\progs\perl5100\lib\CORE" -machine:x86' libpth=\lib libs= oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib perllibs= oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.lib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib msvcrt.lib libc=msvcrt.lib, so=dll, useshrplib=true, libperl=perl510.lib gnulibc_version='' Dynamic Linking: dlsrc=dl_win32.xs, dlext=dll, d_dlsymun=undef, ccdlflags=' ' cccdlflags=' ', lddlflags='-dll -nologo -nodefaultlib -debug -opt:ref,icf -libpath:"C:\progs\perl5100\lib\CORE" -machine:x86' Locally applied patches: ACTIVEPERL_LOCAL_PATCHES_ENTRY 33741 avoids segfaults invoking S_raise_signal() (on Linux) 33763 Win32 process ids can have more than 16 bits 32809 Load 'loadable object' with non-default file extension 32728 64-bit fix for Time::Local @INC for perl 5.10.0: c:/progs/perl5100/site/lib c:/progs/perl5100/lib . Environment for perl 5.10.0: HOME (unset) LANG (unset) LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=c:\bin;c:\progs\perl5100\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\WBEM PERL_BADLANG (unset) SHELL (unset) ```
p5pRT commented 14 years ago

From @hvds

"Eric Brine" (via RT) \perlbug\-followup@​perl\.org wrote​: :A regression was introduced into 5.10.0 concerning /g and zero-width :patterns. The demo speaks for itself​: : :>c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc :bc :c : :>c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc

With -Dr the first difference between the two cases is that the failing case discovers optimizations "stclass NSPACE plus minlen 0" whereas the other finds only "minlen 0".

The failure case searches a second time for a match starting from offset 0\, finds the /\S+/ as before\, and then bails with​:   Match possible\, but length=0 is smaller than requested=1\, failing!   Contradicts stclass... [regexec_flags]   Match failed instead of advancing to offset 1 and trying again.

I'll look further to see if there's an opportunity to improve stclass support to cope with the case\, or whether we must instead revert to either not detecting or throwing away the stclass in this scenario.

Hugo

p5pRT commented 14 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 14 years ago

From @demerphq

2009/8/16 \hv@​crypt\.org​:

"Eric Brine" (via RT) \perlbug\-followup@​perl\.org wrote​: :A regression was introduced into 5.10.0 concerning /g and zero-width :patterns. The demo speaks for itself​: : :>c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc :bc :c : :>c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc

With -Dr the first difference between the two cases is that the failing case discovers optimizations "stclass NSPACE plus minlen 0" whereas the other finds only "minlen 0".

The failure case searches a second time for a match starting from offset 0\, finds the /\S+/ as before\, and then bails with​:  Match possible\, but length=0 is smaller than requested=1\, failing!  Contradicts stclass... [regexec_flags]  Match failed instead of advancing to offset 1 and trying again.

I'll look further to see if there's an opportunity to improve stclass support to cope with the case\, or whether we must instead revert to either not detecting or throwing away the stclass in this scenario.

This rings a bell for me. Is this really a regression or is it a bug fix?

Why do we assume the first is correct?

cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 14 years ago

From @iabyn

On Mon\, Aug 17\, 2009 at 04​:36​:08PM +0200\, demerphq wrote​:

2009/8/16 \hv@​crypt\.org​:

"Eric Brine" (via RT) \perlbug\-followup@​perl\.org wrote​: :A regression was introduced into 5.10.0 concerning /g and zero-width :patterns. The demo speaks for itself​: : :>c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc :bc :c : :>c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc

With -Dr the first difference between the two cases is that the failing case discovers optimizations "stclass NSPACE plus minlen 0" whereas the other finds only "minlen 0".

The failure case searches a second time for a match starting from offset 0\, finds the /\S+/ as before\, and then bails with​:  Match possible\, but length=0 is smaller than requested=1\, failing!  Contradicts stclass... [regexec_flags]  Match failed instead of advancing to offset 1 and trying again.

I'll look further to see if there's an opportunity to improve stclass support to cope with the case\, or whether we must instead revert to either not detecting or throwing away the stclass in this scenario.

This rings a bell for me. Is this really a regression or is it a bug fix?

Why do we assume the first is correct?

Well\, I'd expect a) a look-ahead assertion not to affect pos (so apart from causing additional possible match failures\, it should kind of be invisible b) which means that the first match should leave pos unaffected\, c) which following the 'if pos unchanged bump it by 1 to avoid infinite loops' rule\, means that I would expect the pattern to be tried against \\, \\, \\, \< > in turn (and fail on the fourth match).

-- The Enterprise is involved in a bizarre time-warp experience which is in some way unconnected with the Late 20th Century.   -- Things That Never Happen in "Star Trek" #14

p5pRT commented 14 years ago

From Eirik-Berg.Hanssen@allverden.no

demerphq \demerphq@&#8203;gmail\.com writes​:

2009/8/16 \hv@&#8203;crypt\.org​:

"Eric Brine" (via RT) \perlbug\-followup@&#8203;perl\.org wrote​: :A regression was introduced into 5.10.0 concerning /g and zero-width :patterns. The demo speaks for itself​: : :>c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc :bc :c : :>c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc

This rings a bell for me. Is this really a regression or is it a bug fix?

Why do we assume the first is correct?

  Because\, since the match is zero-length\, we expect[1] pos() to be advanced by 1 before trying to match again. Instead\, it seems to be advanced by the length of the captures or something ...

sidhekin@​blackbox[18​:07​:50]~$ 588perl -le '$x=" abc xyz "; print pos $x\, " $1 "\, length($&) while $x =~ /(?=(\S+))/g' 1 abc 0 2 bc 0 3 c 0 5 xyz 0 6 yz 0 7 z 0 sidhekin@​blackbox[18​:07​:56]~$ 5100perl -le '$x=" abc xyz "; print pos $x\, " $1 "\, length($&) while $x =~ /(?=(\S+))/g' 1 abc 0 5 xyz 0 sidhekin@​blackbox[18​:07​:58]~$

[1]​: Disclaimer​: Our expectations may well be the problem. :-\

Eirik -- Machine. Unexpectedly\, I'd invented a time   - Alan Moore

p5pRT commented 14 years ago

From @ikegami

On Mon\, Aug 17\, 2009 at 10​:36 AM\, demerphq \demerphq@&#8203;gmail\.com wrote​:

"Eric Brine" (via RT) \perlbug\-followup@&#8203;perl\.org wrote​: :>c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc :bc :c : :>c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc

This rings a bell for me. Is this really a regression or is it a bug fix?

Why do we assume the first is correct?

Well something is broken. With 5.10.0 (debian lenny)​:

$ perl -wle'print for "abc " =~ /(?=(\S+))/g' abc

$ perl -wle'print for "abc " =~ /(?{})(?=(\S+))/g' abc bc c

$ perl -wle'print for "abc " =~ /x*(?=(\S+))/g' abc bc c

These should all return the same.

p5pRT commented 14 years ago

From @abigail

On Mon\, Aug 17\, 2009 at 12​:37​:45PM -0400\, Eric Brine wrote​:

On Mon\, Aug 17\, 2009 at 10​:36 AM\, demerphq \demerphq@&#8203;gmail\.com wrote​:

"Eric Brine" (via RT) \perlbug\-followup@&#8203;perl\.org wrote​: :>c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc :bc :c : :>c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc

This rings a bell for me. Is this really a regression or is it a bug fix?

Why do we assume the first is correct?

Well something is broken. With 5.10.0 (debian lenny)​:

$ perl -wle'print for "abc " =~ /(?=(\S+))/g' abc

$ perl -wle'print for "abc " =~ /(?{})(?=(\S+))/g' abc bc c

$ perl -wle'print for "abc " =~ /x*(?=(\S+))/g' abc bc c

These should all return the same.

Something related (broken only in 5.10\, fixed in 5.10.1-RC)​:

  $ /opt/perl/5.8.8/bin/perl -wle 'print for "abc" =~ /(?=)(?=(\S+))/g'   abc   bc   c   $ /opt/perl/5.10.0/bin/perl -wle 'print for "abc" =~ /(?=)(?=(\S+))/g'   $   $ /opt/perl/5.10.1-RC1/bin/perl -wle 'print for "abc" =~ /(?=)(?=(\S+))/g'   abc   bc   c

Not sure what blead does - the last pull I did didn't build.

Abigail

p5pRT commented 13 years ago

From @cpansprout

On Sun Aug 16 01​:15​:58 2009\, ikegami@​adaelis.com wrote​:

A regression was introduced into 5.10.0 concerning /g and zero-width patterns. The demo speaks for itself​:

c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" abc bc c

c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" abc

This was broken by commit 07be1b83a6b2d24b492356181ddf70e1c7917ae3\, which extended stclass optimisations to (?=).

I tried following the code paths for + and for {1\,} (which are meant to be identical\, but only {1\,} was working). I noticed they diverged as a result of + having PREGf_SKIP set.

So I fixed it by not setting PREGf_SKIP if the + is inside a (?=).

I really don’t understand this code\, and would much appreciate any feedback as to whether this fix (or ‘fix’) will break anything else.

p5pRT commented 13 years ago

From @cpansprout

From​: Father Chrysostomos \sprout@&#8203;cpan\.org

[perl #68564] /g failure with zero-width patterns

To make "abc " =~ /(?=(\S+))/g work again (it should return "abc"\, "bc"\, "c" on each successive match\, respectively; it was returning just "abc")\, this commit disables the PREGf_SKIP optimisation on a PLUS if it is inside an IFMATCH (?=).

Inline Patch ```diff diff --git a/pod/perldelta.pod b/pod/perldelta.pod index b478273..bb1af44 100644 --- a/pod/perldelta.pod +++ b/pod/perldelta.pod @@ -698,6 +698,13 @@ Stringifying a scalar containing -0.0 no longer has the affect of turning false into true L<[perl #45133]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=45133>. +=item * + +Using a regular expression lookhead assertion containing a subpattern +quantified with a C<+> inside a global pattern (i.e., +C<"abc " =~ /(?=(\S+))/g>) now works again, as it did in 5.8.x +L<[perl #68564]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=68564>. + =back =head1 Known Problems diff --git a/regcomp.c b/regcomp.c index d6f3523..c3867e9 100644 --- a/regcomp.c +++ b/regcomp.c @@ -4595,6 +4595,7 @@ reStudy: I32 last_close = 0; /* pointed to by data */ regnode *first= scan; regnode *first_next= regnext(first); + bool lookahead = FALSE; /* * Skip introductions and multiplicators >= 1 @@ -4611,7 +4612,7 @@ reStudy: /* An OR of *one* alternative - should not happen now. */ (OP(first) == BRANCH && OP(first_next) != BRANCH) || /* for now we can't handle lookbehind IFMATCH*/ - (OP(first) == IFMATCH && !first->flags) || + (OP(first) == IFMATCH && !first->flags && (lookahead = 1)) || (OP(first) == PLUS) || (OP(first) == MINMOD) || /* An {n,m} with n>0 */ @@ -4699,7 +4700,8 @@ reStudy: goto again; } if (sawplus && (!sawopen || !RExC_sawback) - && !(RExC_seen & REG_SEEN_EVAL)) /* May examine pos and $& */ + && !(RExC_seen & REG_SEEN_EVAL) + && !lookahead) /* May examine pos and $& */ /* x+ must match at the 1st pos of run of x's */ r->intflags |= PREGf_SKIP; diff --git a/t/re/pat_rt_report.t b/t/re/pat_rt_report.t index 33b6f7c..cf176ec 100644 --- a/t/re/pat_rt_report.t +++ b/t/re/pat_rt_report.t @@ -21,7 +21,7 @@ BEGIN { } -plan tests => 2510; # Update this when adding/deleting tests. +plan tests => 2511; # Update this when adding/deleting tests. run_tests() unless caller; @@ -1181,6 +1181,14 @@ sub run_tests { iseq($first, $second); } + + { + local $BugId = 68564; # minimal CURLYM limited to 32767 matches + iseq + join("-", "abc " =~ /(?=(\S+))/g), + "abc-bc-b", + "[perl #68564] stclass optimisation does not break + inside (?=)"; + } } # End of sub run_tests 1; ```
p5pRT commented 13 years ago

From @nwc10

On Mon\, Aug 17\, 2009 at 10​:55​:09PM +0200\, Abigail wrote​:

Something related (broken only in 5.10\, fixed in 5.10.1-RC)​:

$ /opt/perl/5.8.8/bin/perl -wle 'print for "abc" =~ /(?=)(?=(\S+))/g' abc bc c $ /opt/perl/5.10.0/bin/perl -wle 'print for "abc" =~ /(?=)(?=(\S+))/g' $ $ /opt/perl/5.10.1-RC1/bin/perl -wle 'print for "abc" =~ /(?=)(?=(\S+))/g' abc bc c

Not sure what blead does - the last pull I did didn't build.

I bisected with

#!/bin/sh git clean -dxf touch .patchnum touch .sha1 touch unpushed.h # If you can use ccache\, add -Dcc=ccache\ gcc -Dld=gcc to the Configure line # if Encode is not needed for the test\, you can speed up the bisect by # excluding it from the runs with -Dnoextensions=Encode sh Configure -des -Dusedevel -Uusethreads -Doptimize="-g" -Dcc=ccache\ gcc -Dld=gcc -Dnoextensions=IPC/SysV\ Encode\ DB_File test -f config.sh || exit 125 # Correct makefile for newer GNU gcc perl -ni -we 'print unless /\<(?​:built-in|command)/' makefile x2p/makefile # if you just need miniperl\, replace test_prep with miniperl make -j5 miniperl [ -x ./miniperl ] || exit 125 ./miniperl -wle 'push @​a\, $_ for "abc" =~ /(?=)(?=(\S+))/g; exit !!@​a' ret=$? [ $ret -gt 127 ] && ret=127 git clean -dxf exit $ret

and found that this was fixed by

commit 89c6a13e141e02cc3af670ab47c1d41ac4e81ba0 Author​: Ævar ArnfjörĂ° Bjarmason \avar@&#8203;cpan\.org Date​: Thu Apr 10 00​:38​:52 2008 +0000

  Re​: [perl #52672] regexp failure​: (?=) turns into OPFAIL   From​: "Ævar ArnfjörĂ° Bjarmason" \avarab@&#8203;gmail\.com   Message-ID​: \51dd1af80804091738r15d37763lf900d59f8bcc5e81@&#8203;mail\.gmail\.com  
  p4raw-id​: //depot/perl@​33667

Inline Patch ```diff diff --git a/regcomp.c b/regcomp.c index 05b54ee..07d5535 100644 --- a/regcomp.c +++ b/regcomp.c @@ -5691,6 +5691,8 @@ S_reg(pTHX_ RExC_state_t *pRExC_state, I32 paren, I32 *flagp,U32 depth) RExC_seen |= REG_SEEN_LOOKBEHIND; RExC_parse++; case '=': /* (?=...) */ + RExC_seen_zerolen++; + break; case '!': /* (?!...) */ RExC_seen_zerolen++; if (*RExC_parse == ')') { diff --git a/t/op/re_tests b/t/op/re_tests index 3df3745..6894458 100644 --- a/t/op/re_tests +++ b/t/op/re_tests @@ -414,6 +414,7 @@ a[-]?c ac y $& ac '(abc)\1'i ABCABC y $1 ABC '([a-c]*)\1'i ABCABC y $1 ABC a(?!b). abad y $& ad +(?=)a a y $& a a(?=d). abad y $& ad a(?=c|d). abad y $& ad a(?:b|c|d)(.) ace y $1 e ```

I don't know if that test is equivalent to your code. Should your code be added as an additional test?

Nicholas Clark

p5pRT commented 13 years ago

From @khwilliamson

Father Chrysostomos via RT wrote​:

On Sun Aug 16 01​:15​:58 2009\, ikegami@​adaelis.com wrote​:

A regression was introduced into 5.10.0 concerning /g and zero-width patterns. The demo speaks for itself​:

c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" abc bc c

c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" abc

This was broken by commit 07be1b83a6b2d24b492356181ddf70e1c7917ae3\, which extended stclass optimisations to (?=).

I tried following the code paths for + and for {1\,} (which are meant to be identical\, but only {1\,} was working). I noticed they diverged as a result of + having PREGf_SKIP set.

So I fixed it by not setting PREGf_SKIP if the + is inside a (?=).

I really don’t understand this code\, and would much appreciate any feedback as to whether this fix (or ‘fix’) will break anything else.

I'm afraid I don't know this code either.

p5pRT commented 13 years ago

From @ikegami

On Mon\, Oct 11\, 2010 at 10​:39 AM\, Nicholas Clark \nick@&#8203;ccl4\.org wrote​:

I don't know if that test is equivalent to your code. Should your code be added as an additional test?

I don't see why not. Especially since it still fails for blead.

Patch with test attached.

- Eric

p5pRT commented 13 years ago

From @ikegami

0001-Test-for-RT-68564-perl-wle-print-for-abc-S-g.patch ```diff From 4508e9547d2d335f019b9cdfb4b48881749a46be Mon Sep 17 00:00:00 2001 From: Eric Brine Date: Mon, 11 Oct 2010 17:57:03 -0700 Subject: [PATCH] Test for RT##68564: perl -wle"print for 'abc' =~ /(?=(\S+))/g" --- t/re/pat.t | 9 +++++++++ 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/t/re/pat.t b/t/re/pat.t index c007880..b1d43f7 100644 --- a/t/re/pat.t +++ b/t/re/pat.t @@ -202,6 +202,15 @@ sub run_tests { } { + local $Message = "Test zero-width match with /g"; + local $" = ":"; + $_ = "abc"; + my @words = 'abc' =~ /(?=(\S+))/g; + my $exp = "abc:cb:c"; + iseq "@words", $exp; + } + + { $_ = "abcdefghi"; my $pat1 = 'def'; -- 1.7.1.1 ```
p5pRT commented 13 years ago

From @ikegami

On Mon\, Oct 11\, 2010 at 9​:23 PM\, Eric Brine \ikegami@&#8203;adaelis\.com wrote​:

On Mon\, Oct 11\, 2010 at 10​:39 AM\, Nicholas Clark \nick@&#8203;ccl4\.org wrote​:

I don't know if that test is equivalent to your code. Should your code be added as an additional test?

I don't see why not. Especially since it still fails for blead.

Patch with test attached.

- Eric

Forgot to increment number of tests. Updated patch attached.

p5pRT commented 13 years ago

From @ikegami

From e81434cbf81a1bb93004b920bb35c502c51bf182 Mon Sep 17 00​:00​:00 2001 From​: Eric Brine \ikegami@&#8203;adaelis\.com Date​: Mon\, 11 Oct 2010 18​:43​:01 -0700 Subject​: [PATCH] Test for RT##68564​: perl -wle"print for 'abc' =~ /(?=(\S+))/g"


t/re/pat.t | 11 ++++++++++- 1 files changed\, 10 insertions(+)\, 1 deletions(-)

diff --git a/t/re/pat.t b/t/re/pat.t index c007880..c226b62 100644 --- a/t/re/pat.t +++ b/t/re/pat.t @​@​ -23\,7 +23\,7 @​@​ BEGIN { }

-plan tests => 398; # Update this when adding/deleting tests. +plan tests => 399; # Update this when adding/deleting tests.

run_tests() unless caller;

@​@​ -202\,6 +202\,15 @​@​ sub run_tests {   }

  { + local $Message = "Test zero-width match with /g"; + local $" = "​:"; + $_ = "abc"; + my @​words = 'abc' =~ /(?=(\S+))/g; + my $exp = "abc​:cb​:c"; + iseq "@​words"\, $exp; + } + + {   $_ = "abcdefghi";

  my $pat1 = 'def'; -- 1.7.1.1

p5pRT commented 13 years ago

From @demerphq

On 10 October 2010 23​:27\, Father Chrysostomos via RT \perlbug\-followup@&#8203;perl\.org wrote​:

On Sun Aug 16 01​:15​:58 2009\, ikegami@​adaelis.com wrote​:

A regression was introduced into 5.10.0 concerning /g and zero-width patterns. The demo speaks for itself​:

c​:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" abc bc c

c​:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" abc

This was broken by commit 07be1b83a6b2d24b492356181ddf70e1c7917ae3\, which extended stclass optimisations to (?=).

I tried following the code paths for + and for {1\,} (which are meant to be identical\, but only {1\,} was working). I noticed they diverged as a result of + having PREGf_SKIP set.

So I fixed it by not setting PREGf_SKIP if the + is inside a (?=).

I really don’t understand this code\, and would much appreciate any feedback as to whether this fix (or ‘fix’) will break anything else.

I have looked into this more deeply and applied a modified version of your patch\, which IMO was more or less exactly correct.

Here is the commit message I wrote​:

commit e7f38d0fe17e7a846c0ed55e71ebb120a336b887 Author​: Yves Orton \demerphq@&#8203;gmail\.com Date​: Wed Nov 3 10​:23​:00 2010 +0100

  fix 68564​: /g failure with zero-width patterns

  This is based on a patch by Father Chrysostomos \sprout@&#8203;cpan\.org

  The start class optimisation has two modes\, "try every valid start   position" (doevery) and "flip flop mode" (!doevery) where it trys   only the first valid start position in a sequence.

  Consider /(\d+)X/ and the string "123456Y"\, now we know that if we fail   to match X after matching "123456" then we will also fail to match after   "23456" (assuming no evil tricks are in place\, which disable the   optimisation anyway)\, so we know we can skip forward until the check   /fails/ and only then start looking for a real match. This is flip-flop   mode.

  Now consider the case with zero-width lookahead under /g​: /(?=(\d+)X)/.   In this case we have an additional failure mode\, that is failure when   we match a zero-width string twice at the same pos(). So now\, the   "flip-flop" logic breaks as it /is/ possible that we could match at   "23456" when we couldn't match at "123456" because of the zero-length   twice at the same pos() rule. For instance​:

  print $1 for "123"=~/(?=(\d+))/g

  should first match "123". Since $& is zero length\, pos() is not   incremented. We then match again\, successfully\, except that the match   is rejected despite technical-success because its $& is also zero   length and pos() has not advanced. If the flip-flop mode is enabled   we wont retry until we find a failing character first.

  The point here is that it makes perfect sense to disable the   "flip-flop" mode optimisation when the start class is inside   a lookahead as it really doesnt apply.

IMO your patch was quite right\, although I had to dig fairly deep to understand why.

Thanks for the patch.

BTW\, I am a bit curious if there are any other flaws in the flip-flop logic. I tried reasonably hard to make it fail without the zero-width lookahead\, and was unable to find a failure case\, but I still kinda feel like there might be some interesting edge case

Cheers\, yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 13 years ago

@cpansprout - Status changed from 'open' to 'resolved'