Closed p5pRT closed 13 years ago
A regression was introduced into 5.10.0 concerning /g and zero-width patterns. The demo speaks for itself:
c:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" abc bc c
c:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" abc
I expect the 5.8 behaviour. 5.6.x and 5.8.x behave like 5.8.9. I've been told 5.10.1-RC1 behaves like 5.10.0. I don't know about blead.
Eric "ikegami" Brine
"Eric Brine" (via RT) \perlbug\-followup@​perl\.org wrote: :A regression was introduced into 5.10.0 concerning /g and zero-width :patterns. The demo speaks for itself: : :>c:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc :bc :c : :>c:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc
With -Dr the first difference between the two cases is that the failing case discovers optimizations "stclass NSPACE plus minlen 0" whereas the other finds only "minlen 0".
The failure case searches a second time for a match starting from offset 0\, finds the /\S+/ as before\, and then bails with: Match possible\, but length=0 is smaller than requested=1\, failing! Contradicts stclass... [regexec_flags] Match failed instead of advancing to offset 1 and trying again.
I'll look further to see if there's an opportunity to improve stclass support to cope with the case\, or whether we must instead revert to either not detecting or throwing away the stclass in this scenario.
Hugo
The RT System itself - Status changed from 'new' to 'open'
2009/8/16 \hv@​crypt\.org:
"Eric Brine" (via RT) \perlbug\-followup@​perl\.org wrote: :A regression was introduced into 5.10.0 concerning /g and zero-width :patterns. The demo speaks for itself: : :>c:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc :bc :c : :>c:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc
With -Dr the first difference between the two cases is that the failing case discovers optimizations "stclass NSPACE plus minlen 0" whereas the other finds only "minlen 0".
The failure case searches a second time for a match starting from offset 0\, finds the /\S+/ as before\, and then bails with: Â Match possible\, but length=0 is smaller than requested=1\, failing! Â Contradicts stclass... [regexec_flags] Â Match failed instead of advancing to offset 1 and trying again.
I'll look further to see if there's an opportunity to improve stclass support to cope with the case\, or whether we must instead revert to either not detecting or throwing away the stclass in this scenario.
This rings a bell for me. Is this really a regression or is it a bug fix?
Why do we assume the first is correct?
cheers\, Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
On Mon\, Aug 17\, 2009 at 04:36:08PM +0200\, demerphq wrote:
2009/8/16 \hv@​crypt\.org:
"Eric Brine" (via RT) \perlbug\-followup@​perl\.org wrote: :A regression was introduced into 5.10.0 concerning /g and zero-width :patterns. The demo speaks for itself: : :>c:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc :bc :c : :>c:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc
With -Dr the first difference between the two cases is that the failing case discovers optimizations "stclass NSPACE plus minlen 0" whereas the other finds only "minlen 0".
The failure case searches a second time for a match starting from offset 0\, finds the /\S+/ as before\, and then bails with: Â Match possible\, but length=0 is smaller than requested=1\, failing! Â Contradicts stclass... [regexec_flags] Â Match failed instead of advancing to offset 1 and trying again.
I'll look further to see if there's an opportunity to improve stclass support to cope with the case\, or whether we must instead revert to either not detecting or throwing away the stclass in this scenario.
This rings a bell for me. Is this really a regression or is it a bug fix?
Why do we assume the first is correct?
Well\, I'd expect
a) a look-ahead assertion not to affect pos (so apart from causing
additional possible match failures\, it should kind of be invisible
b) which means that the first match should leave pos unaffected\,
c) which following the 'if pos unchanged bump it by 1 to avoid infinite
loops' rule\, means that I would expect the pattern to be tried against
\
-- The Enterprise is involved in a bizarre time-warp experience which is in some way unconnected with the Late 20th Century. -- Things That Never Happen in "Star Trek" #14
demerphq \demerphq@​gmail\.com writes:
2009/8/16 \hv@​crypt\.org:
"Eric Brine" (via RT) \perlbug\-followup@​perl\.org wrote: :A regression was introduced into 5.10.0 concerning /g and zero-width :patterns. The demo speaks for itself: : :>c:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc :bc :c : :>c:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc
This rings a bell for me. Is this really a regression or is it a bug fix?
Why do we assume the first is correct?
Because\, since the match is zero-length\, we expect[1] pos() to be advanced by 1 before trying to match again. Instead\, it seems to be advanced by the length of the captures or something ...
sidhekin@blackbox[18:07:50]~$ 588perl -le '$x=" abc xyz "; print pos $x\, " $1 "\, length($&) while $x =~ /(?=(\S+))/g' 1 abc 0 2 bc 0 3 c 0 5 xyz 0 6 yz 0 7 z 0 sidhekin@blackbox[18:07:56]~$ 5100perl -le '$x=" abc xyz "; print pos $x\, " $1 "\, length($&) while $x =~ /(?=(\S+))/g' 1 abc 0 5 xyz 0 sidhekin@blackbox[18:07:58]~$
[1]: Disclaimer: Our expectations may well be the problem. :-\
Eirik -- Machine. Unexpectedly\, I'd invented a time - Alan Moore
On Mon\, Aug 17\, 2009 at 10:36 AM\, demerphq \demerphq@​gmail\.com wrote:
"Eric Brine" (via RT) \perlbug\-followup@​perl\.org wrote: :>c:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc :bc :c : :>c:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc
This rings a bell for me. Is this really a regression or is it a bug fix?
Why do we assume the first is correct?
Well something is broken. With 5.10.0 (debian lenny):
$ perl -wle'print for "abc " =~ /(?=(\S+))/g' abc
$ perl -wle'print for "abc " =~ /(?{})(?=(\S+))/g' abc bc c
$ perl -wle'print for "abc " =~ /x*(?=(\S+))/g' abc bc c
These should all return the same.
On Mon\, Aug 17\, 2009 at 12:37:45PM -0400\, Eric Brine wrote:
On Mon\, Aug 17\, 2009 at 10:36 AM\, demerphq \demerphq@​gmail\.com wrote:
"Eric Brine" (via RT) \perlbug\-followup@​perl\.org wrote: :>c:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc :bc :c : :>c:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" :abc
This rings a bell for me. Is this really a regression or is it a bug fix?
Why do we assume the first is correct?
Well something is broken. With 5.10.0 (debian lenny):
$ perl -wle'print for "abc " =~ /(?=(\S+))/g' abc
$ perl -wle'print for "abc " =~ /(?{})(?=(\S+))/g' abc bc c
$ perl -wle'print for "abc " =~ /x*(?=(\S+))/g' abc bc c
These should all return the same.
Something related (broken only in 5.10\, fixed in 5.10.1-RC):
$ /opt/perl/5.8.8/bin/perl -wle 'print for "abc" =~ /(?=)(?=(\S+))/g' abc bc c $ /opt/perl/5.10.0/bin/perl -wle 'print for "abc" =~ /(?=)(?=(\S+))/g' $ $ /opt/perl/5.10.1-RC1/bin/perl -wle 'print for "abc" =~ /(?=)(?=(\S+))/g' abc bc c
Not sure what blead does - the last pull I did didn't build.
Abigail
On Sun Aug 16 01:15:58 2009\, ikegami@adaelis.com wrote:
A regression was introduced into 5.10.0 concerning /g and zero-width patterns. The demo speaks for itself:
c:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" abc bc c
c:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" abc
This was broken by commit 07be1b83a6b2d24b492356181ddf70e1c7917ae3\, which extended stclass optimisations to (?=).
I tried following the code paths for + and for {1\,} (which are meant to be identical\, but only {1\,} was working). I noticed they diverged as a result of + having PREGf_SKIP set.
So I fixed it by not setting PREGf_SKIP if the + is inside a (?=).
I really donât understand this code\, and would much appreciate any feedback as to whether this fix (or âfixâ) will break anything else.
From: Father Chrysostomos \sprout@​cpan\.org
[perl #68564] /g failure with zero-width patterns
To make "abc " =~ /(?=(\S+))/g work again (it should return "abc"\, "bc"\, "c" on each successive match\, respectively; it was returning just "abc")\, this commit disables the PREGf_SKIP optimisation on a PLUS if it is inside an IFMATCH (?=).
On Mon\, Aug 17\, 2009 at 10:55:09PM +0200\, Abigail wrote:
Something related (broken only in 5.10\, fixed in 5.10.1-RC):
$ /opt/perl/5.8.8/bin/perl -wle 'print for "abc" =~ /(?=)(?=(\S+))/g' abc bc c $ /opt/perl/5.10.0/bin/perl -wle 'print for "abc" =~ /(?=)(?=(\S+))/g' $ $ /opt/perl/5.10.1-RC1/bin/perl -wle 'print for "abc" =~ /(?=)(?=(\S+))/g' abc bc c
Not sure what blead does - the last pull I did didn't build.
I bisected with
#!/bin/sh git clean -dxf touch .patchnum touch .sha1 touch unpushed.h # If you can use ccache\, add -Dcc=ccache\ gcc -Dld=gcc to the Configure line # if Encode is not needed for the test\, you can speed up the bisect by # excluding it from the runs with -Dnoextensions=Encode sh Configure -des -Dusedevel -Uusethreads -Doptimize="-g" -Dcc=ccache\ gcc -Dld=gcc -Dnoextensions=IPC/SysV\ Encode\ DB_File test -f config.sh || exit 125 # Correct makefile for newer GNU gcc perl -ni -we 'print unless /\<(?:built-in|command)/' makefile x2p/makefile # if you just need miniperl\, replace test_prep with miniperl make -j5 miniperl [ -x ./miniperl ] || exit 125 ./miniperl -wle 'push @a\, $_ for "abc" =~ /(?=)(?=(\S+))/g; exit !!@a' ret=$? [ $ret -gt 127 ] && ret=127 git clean -dxf exit $ret
and found that this was fixed by
commit 89c6a13e141e02cc3af670ab47c1d41ac4e81ba0 Author: Ăvar ArnfjörĂ° Bjarmason \avar@​cpan\.org Date: Thu Apr 10 00:38:52 2008 +0000
Re: [perl #52672] regexp failure: (?=) turns into OPFAIL
From: "Ăvar ArnfjörĂ° Bjarmason" \avarab@​gmail\.com
Message-ID: \51dd1af80804091738r15d37763lf900d59f8bcc5e81@​mail\.gmail\.com
p4raw-id: //depot/perl@33667
I don't know if that test is equivalent to your code. Should your code be added as an additional test?
Nicholas Clark
Father Chrysostomos via RT wrote:
On Sun Aug 16 01:15:58 2009\, ikegami@adaelis.com wrote:
A regression was introduced into 5.10.0 concerning /g and zero-width patterns. The demo speaks for itself:
c:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" abc bc c
c:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" abc
This was broken by commit 07be1b83a6b2d24b492356181ddf70e1c7917ae3\, which extended stclass optimisations to (?=).
I tried following the code paths for + and for {1\,} (which are meant to be identical\, but only {1\,} was working). I noticed they diverged as a result of + having PREGf_SKIP set.
So I fixed it by not setting PREGf_SKIP if the + is inside a (?=).
I really donât understand this code\, and would much appreciate any feedback as to whether this fix (or âfixâ) will break anything else.
I'm afraid I don't know this code either.
On Mon\, Oct 11\, 2010 at 10:39 AM\, Nicholas Clark \nick@​ccl4\.org wrote:
I don't know if that test is equivalent to your code. Should your code be added as an additional test?
I don't see why not. Especially since it still fails for blead.
Patch with test attached.
- Eric
On Mon\, Oct 11\, 2010 at 9:23 PM\, Eric Brine \ikegami@​adaelis\.com wrote:
On Mon\, Oct 11\, 2010 at 10:39 AM\, Nicholas Clark \nick@​ccl4\.org wrote:
I don't know if that test is equivalent to your code. Should your code be added as an additional test?
I don't see why not. Especially since it still fails for blead.
Patch with test attached.
- Eric
Forgot to increment number of tests. Updated patch attached.
From e81434cbf81a1bb93004b920bb35c502c51bf182 Mon Sep 17 00:00:00 2001 From: Eric Brine \ikegami@​adaelis\.com Date: Mon\, 11 Oct 2010 18:43:01 -0700 Subject: [PATCH] Test for RT##68564: perl -wle"print for 'abc' =~ /(?=(\S+))/g"
t/re/pat.t | 11 ++++++++++- 1 files changed\, 10 insertions(+)\, 1 deletions(-)
diff --git a/t/re/pat.t b/t/re/pat.t index c007880..c226b62 100644 --- a/t/re/pat.t +++ b/t/re/pat.t @@ -23\,7 +23\,7 @@ BEGIN { }
-plan tests => 398; # Update this when adding/deleting tests. +plan tests => 399; # Update this when adding/deleting tests.
run_tests() unless caller;
@@ -202\,6 +202\,15 @@ sub run_tests { }
{ + local $Message = "Test zero-width match with /g"; + local $" = ":"; + $_ = "abc"; + my @words = 'abc' =~ /(?=(\S+))/g; + my $exp = "abc:cb:c"; + iseq "@words"\, $exp; + } + + { $_ = "abcdefghi";
my $pat1 = 'def'; -- 1.7.1.1
On 10 October 2010 23:27\, Father Chrysostomos via RT \perlbug\-followup@​perl\.org wrote:
On Sun Aug 16 01:15:58 2009\, ikegami@adaelis.com wrote:
A regression was introduced into 5.10.0 concerning /g and zero-width patterns. The demo speaks for itself:
c:\progs\perl589\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" abc bc c
c:\progs\perl5100\bin\perl -wle"print for 'abc ' =~ /(?=(\S+))/g" abc
This was broken by commit 07be1b83a6b2d24b492356181ddf70e1c7917ae3\, which extended stclass optimisations to (?=).
I tried following the code paths for + and for {1\,} (which are meant to be identical\, but only {1\,} was working). I noticed they diverged as a result of + having PREGf_SKIP set.
So I fixed it by not setting PREGf_SKIP if the + is inside a (?=).
I really donât understand this code\, and would much appreciate any feedback as to whether this fix (or âfixâ) will break anything else.
I have looked into this more deeply and applied a modified version of your patch\, which IMO was more or less exactly correct.
Here is the commit message I wrote:
commit e7f38d0fe17e7a846c0ed55e71ebb120a336b887 Author: Yves Orton \demerphq@​gmail\.com Date: Wed Nov 3 10:23:00 2010 +0100
fix 68564: /g failure with zero-width patterns
This is based on a patch by Father Chrysostomos \sprout@​cpan\.org
The start class optimisation has two modes\, "try every valid start position" (doevery) and "flip flop mode" (!doevery) where it trys only the first valid start position in a sequence.
Consider /(\d+)X/ and the string "123456Y"\, now we know that if we fail to match X after matching "123456" then we will also fail to match after "23456" (assuming no evil tricks are in place\, which disable the optimisation anyway)\, so we know we can skip forward until the check /fails/ and only then start looking for a real match. This is flip-flop mode.
Now consider the case with zero-width lookahead under /g: /(?=(\d+)X)/. In this case we have an additional failure mode\, that is failure when we match a zero-width string twice at the same pos(). So now\, the "flip-flop" logic breaks as it /is/ possible that we could match at "23456" when we couldn't match at "123456" because of the zero-length twice at the same pos() rule. For instance:
print $1 for "123"=~/(?=(\d+))/g
should first match "123". Since $& is zero length\, pos() is not incremented. We then match again\, successfully\, except that the match is rejected despite technical-success because its $& is also zero length and pos() has not advanced. If the flip-flop mode is enabled we wont retry until we find a failing character first.
The point here is that it makes perfect sense to disable the "flip-flop" mode optimisation when the start class is inside a lookahead as it really doesnt apply.
IMO your patch was quite right\, although I had to dig fairly deep to understand why.
Thanks for the patch.
BTW\, I am a bit curious if there are any other flaws in the flip-flop logic. I tried reasonably hard to make it fail without the zero-width lookahead\, and was unable to find a failure case\, but I still kinda feel like there might be some interesting edge case
Cheers\, yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
@cpansprout - Status changed from 'open' to 'resolved'
Migrated from rt.perl.org#68564 (status was 'resolved')
Searchable as RT68564$