Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.94k stars 554 forks source link

Possible regex bug in 5.6.1 #4276

Closed p5pRT closed 20 years ago

p5pRT commented 23 years ago

Migrated from rt.perl.org#7438 (status was 'resolved')

Searchable as RT7438$

p5pRT commented 23 years ago

From ph10@cus.cam.ac.uk


print "Running Perl $]\n\n";

print "\"ABABAB\" =~ /(AB)*?\\1/\n";

if ("ABABAB" =~ /(AB)*?\1/)   {   print "Match\n";   print "\$1 is defined as \"$1\"\n" if (defined $1);   } else   {   print "No match\n";   }


produces this output​:


Running Perl 5.006001

"ABABAB" =~ /(AB)*?\1/ Match $1 is defined as ""


Is this right? The (AB)*? has matched zero times\, but the \1 is treating this as if the subpattern had matched an empty string\, but in fact it hasn't matched at all.

Consider the case of /(AB)|(CD)/ matching against "CD". In this case\, $1 is not defined after the successful match.

This is a subtle area\, but I'm not sure that these two cases should be different. Certainly it feels wrong that the contents of the parens (AB) are ever reported as having matched anything other than "AB"...

Regards\, Philip

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

Is this right? The (AB)*? has matched zero times\, but the \1 is treating this as if the subpattern had matched an empty string\, but in fact it hasn't matched at all.

It *did* match. You told the regex to match 'AB' zero or more times\, preferring as few times as possible (the question mark). So it matched 'AB' zero times at the beginning of your string (that is\, the empty string that comes at the beginning)\, and that empty string (which matched /(AB)*?/) was indeed followed immediately by another copy of itself\, so it satisfied the match.

If you want to match AB at least one time\, say so​: /(AB)+?\1/ . Then it will match 'AB' (in fact\, the 'AB' immediately at the beginning of the string) into $1.

Moral​: beware of strings or subexpressions that can match the empty string :-)

Regards\, Philip

Cheers\, Philip

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

I am away from my office for a while\, and will not be back until Monday\, August 13. I hope be able to read email while I am away\, but may be slow in responding.

If you are having a problem with Exim\, try asking on exim-users@​exim.org. You need to subscribe to the list before posting - you can do this from the web site at http​://www.exim.org. There are some nice people out there who may well be able to help you.

Regards\, Philip

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

It *did* match. You told the regex to match 'AB' zero or more times\, preferring as few times as possible (the question mark). So it matched 'AB' zero times at the beginning of your string (that is\, the empty string that comes at the beginning)\, and that empty string (which matched /(AB)*?/) was indeed followed immediately by another copy of itself\, so it satisfied the match.

Thanks for your quick response\, but I'm afraid I must beg to differ. (I do know quite a lot about regexs\, as I'm the author of the PCRE library\, in case you didn't know.)

1. Here's a different approach​:

Suppose I match (A|B)*CD against the string ABCD. After the match\, the value of $1 is "B". This is what I expect. If I match the same pattern against the string "CD"\, after the match $1 is unset. This is also what I expect. Here's the test script​:

$x = "ABCD"; $x =~ /(A|B)*CD/; if (defined $1) { print "$1\n"; } else { print "undefined\n"; } $x = "CD"; $x =~ /(A|B)*CD/; if (defined $1) { print "$1\n"; } else { print "undefined\n"; }

It outputs

B undefined

So when the parens match zero times\, the value of the corresponding $ variable is undefined\, not the empty string. I do not understand why this is different to my previous example\, which was

"ABABAB" =~ /(AB)*?\1/

Why does matching zero times behave differently here? [More discussion of this case below.]

2. It seems to be related to the presence of "?". Look at this test​:

$x = "CD"; $x =~ /(AB)*?CD/; if (defined $1) { print "\"$1\"\n"; } else { print "undefined\n"; }

$x = "CD"; $x =~ /(AB)*CD/; if (defined $1) { print "\"$1\"\n"; } else { print "undefined\n"; }

The only different in the two cases is the "?" in the first one. In both cases the group must match zero times; but for the first\, $1 is set to ""\, whereas in the second case it is undefined. This just has to be a bug - they ought at least to give the same result.

Regards\, Philip


[More discussion of 1.] When matching (A|B)*CD against ABCD\, $1 is initially unset. After the first time through the group is becomes "A"; after the second time through\, it becomes "B". The third time through fails\, so the value is reset to what it was before\, namely "B". Similarly\, when matching the same pattern agains CD\, the first time fails\, so $1 is reset to what it was before\, namely unset. This is indeed what Perl does in this example. What I don't understand is why /(AB)*\1/ is different.

p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

$x = "CD"; $x =~ /(AB)*CD/; if (defined $1) { print "\"$1\"\n"; } else { print "undefined\n"; }

You know\, the first or second time I got caught by this\, I decided that it was due to my own bad programming\, and these inner patterns that match zero or more times can return "Z" for all I care (and I wouldn't even consider it a bug) because I'll be doing something like​:

$x =~ /((AB)*)CD/; if ($1) { print $2 }

To me it would be better if the regex engine were optimized to not worry about $2 if $1 matches the empty string. But isn't this all patched now to return undef for these sorts of matches in 5.7.x?

JHMO\, Douglas Wilson

p5pRT commented 23 years ago

From @vanstyn

Seems like a bug to me. I'll take a look.

I think this is the clearest test case I've come up with​:

crypt% perl -wle '"CD" =~ /(A|B)*CD/ and defined($1) and print "defined"' crypt% perl -wle '"CD" =~ /(A|B)*?CD/ and defined($1) and print "defined"' defined crypt%

.. which acts the same at least as far back as 5.005_03 and forward to 5.7.2.

Hugo

p5pRT commented 23 years ago

From @vanstyn

Attached patch over @​11692 fixes it\, as per the new tests\, and tries to improve the comments slightly - the actual bugfix is the second chunk. Should also apply to 5.6.*.

Hugo

Inline Patch ```diff --- regexec.c.old Wed Aug 15 14:32:44 2001 +++ regexec.c Thu Aug 16 16:18:03 2001 @@ -3033,12 +3033,15 @@ minmod = 0; if (ln && regrepeat_hard(scan, ln, &l) < ln) sayNO; - if (ln && l == 0 && n >= ln - /* In fact, this is tricky. If paren, then the - fact that we did/didnot match may influence - future execution. */ - && !(paren && ln == 0)) - ln = n; + /* if we matched something zero-length we don't need to + backtrack - capturing parens are already defined, so + the caveat in the maximal case doesn't apply + + XXXX if ln == 0, we can redo this check first time + through the following loop + */ + if (ln && l == 0) + n = ln; /* don't backtrack */ locinput = PL_reginput; if (PL_regkind[(U8)OP(next)] == EXACT) { c1 = (U8)*STRING(next); @@ -3060,7 +3063,7 @@ UCHARAT(PL_reginput) == c2) { if (paren) { - if (n) { + if (ln) { PL_regstartp[paren] = HOPc(PL_reginput, -l) - PL_bostr; PL_regendp[paren] = PL_reginput - PL_bostr; @@ -3084,12 +3087,13 @@ } else { n = regrepeat_hard(scan, n, &l); - if (n != 0 && l == 0 - /* In fact, this is tricky. If paren, then the - fact that we did/didnot match may influence - future execution. */ - && !(paren && ln == 0)) - ln = n; + /* if we matched something zero-length we don't need to + backtrack, unless the minimum count is zero and we + are capturing the result - in that case the capture + being defined or not may affect later execution + */ + if (n != 0 && l == 0 && !(paren && ln == 0)) + ln = n; /* don't backtrack */ locinput = PL_reginput; DEBUG_r( PerlIO_printf(Perl_debug_log, --- t/op/re_tests.old Wed Aug 15 14:32:44 2001 +++ t/op/re_tests Thu Aug 16 16:14:09 2001 @@ -792,3 +792,7 @@ ^(a(??{"(?!)"})|(a)(?{1}))b ab y $2 a # [ID 20010811.006] ab(?i)cd AbCd n - - # [ID 20010809.023] ab(?i)cd abCd y - - +(A|B)*(?(1)(CD)|(CD)) CD y $2-$3 -CD +(A|B)*(?(1)(CD)|(CD)) ABCD y $2-$3 CD- +(A|B)*?(?(1)(CD)|(CD)) CD y $2-$3 -CD # [ID 20010803.016] +(A|B)*?(?(1)(CD)|(CD)) ABCD y $2-$3 CD- ```
p5pRT commented 23 years ago

From [Unknown Contact. See original ticket]

Attached patch over @​11692 fixes it\, as per the new tests\, and tries to improve the comments slightly - the actual bugfix is the second chunk.

Thanks. I can confirm that it fixes it for my tests too.

Regards\, Philip