Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.94k stars 554 forks source link

Regexp behavior in matching backreferences to optional captures #2271

Closed p5pRT closed 20 years ago

p5pRT commented 24 years ago

Migrated from rt.perl.org#3589 (status was 'open')

Searchable as RT3589$

p5pRT commented 24 years ago

From @vanstyn

Created by @vanstyn

crypt% perl -wle '/^((.)?a\2)+$/ and print "$_​: -$1-$2-" for qw/ babadad babadab bababdab bababdad /' bababdad​: -dad-d- crypt%

That's with 5.6.0; with bleadperl\, none of the strings match. I believe it should match 'babadad'.

Perl Info ``` Flags: category=core severity=medium Site configuration information for perl v5.6.0: Configured by hv at Tue Jul 18 23:29:44 BST 2000. Summary of my perl5 (revision 5.0 version 6 subversion 0) configuration: Platform: osname=linux, osvers=2.2.5-16, archname=i686-linux uname='linux crypt.compulink.co.uk 2.2.5-16 #1 sun may 30 23:00:18 bst 1999 i686 unknown ' config_args='-des -Doptimize=-g -O6 -Dprefix=/opt/perl-blead' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef useperlio=undef d_sfio=undef uselargefiles=define use64bitint=undef use64bitall=undef uselongdouble=undef usesocks=undef Compiler: cc='cc', optimize='-g -O6', gccversion=egcs-2.91.66 19990314/Linux (egcs-1.1.2 release) cppflags='-DDEBUGGING -fno-strict-aliasing' ccflags ='-DDEBUGGING -fno-strict-aliasing -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64' stdchar='char', d_stdstdio=define, usevfork=false intsize=4, longsize=4, ptrsize=4, doublesize=8 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, usemymalloc=n, prototype=define Linker and Libraries: ld='cc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -lndbm -lgdbm -ldb -ldl -lm -lc -lposix -lcrypt libc=/lib/libc-2.1.1.so, so=so, useshrplib=false, libperl=libperl.a Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic' cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib' Locally applied patches: bleadperl @INC for perl v5.6.0: lib /opt/perl-blead/lib/5.6.0/i686-linux /opt/perl-blead/lib/5.6.0 /opt/perl-blead/lib/site_perl/5.6.0/i686-linux /opt/perl-blead/lib/site_perl/5.6.0 /opt/perl-blead/lib/site_perl . Environment for perl v5.6.0: HOME=/home/hv LANG (unset) LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/hv/bin:/usr/bin:/usr/local/bin:/bin:/usr/X11R6/bin PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 23 years ago

From @vanstyn

More​: crypt% ./perl -Dr -wle '/^((.)?a\2)+$/ and print "$_​: -$1-$2-" for qw/ babadad /' 2>&1 | grep restoring   restoring \1 to 0(0)..3   restoring \2 to 0(1882390424)..1 crypt%

That odd number turns out to occur because PL_reg_start_tmp[paren] is \x78455220 = " REx".

Hugo

p5pRT commented 20 years ago

From @smpeters

crypt% perl -wle '/^((.)?a\2)+$/ and print "$_​: -$1-$2-" for qw/ babadad babadab bababdab bababdad /' bababdad​: -dad-d- crypt%

That's with 5.6.0; with bleadperl\, none of the strings match. I believe it should match 'babadad'.

It appears that there is a problem with your regexp.

/^((.)?a\2)+$/

The "^" is saying start at the beginning of the string\, and none of the strings start with "dad". Also\, the enclosed match must be three characters long. Finally\, the "+$" means can appear one or more times before the end. So\, this regexp is looking for a three character string that is of the form XaX (X is any character) that appears one or more times to the complete end of the string. That said\, the following one-liner demonstrates your regexp...

steve@​kirk sandbox $ perl -wle '/^((.)?a\2)+$/ and print "$_​: -$1-$2-" for qw/ dad dadad daddad babadad babadab bababdab bababdad /' dad​: -dad-d- daddad​: -dad-d-

"dad" matches exactly as does "daddad". "dadad" does not match since the "dad" is not repeated completely throughout the string from beginning to end.

This problem does not appear to be a bug with Perl\, but with the regular expression.

p5pRT commented 20 years ago

@smpeters - Status changed from 'open' to 'rejected'

p5pRT commented 20 years ago

From @hvds

"Steve Peters via RT" \perlbug\-followup@​perl\.org wrote​: :> crypt% perl -wle '/^((.)?a\2)+$/ and print "$_​: -$1-$2-" for qw/ :> babadad babadab bababdab bababdad /' :> bababdad​: -dad-d- :> crypt% :> :> That's with 5.6.0; with bleadperl\, none of the strings match. :> I believe it should match 'babadad'. :> : :It appears that there is a problem with your regexp. : :/^((.)?a\2)+$/ : :The "^" is saying start at the beginning of the string\, and none of the :strings start with "dad". Also\, the enclosed match must be three :characters long.

That depends on your definitions\, and I think the problem here is that the regexp explores behaviour that has never been explicitly defined.

If the '?' is moved inside the parens​:   /^((.?)a\2)+$/ then expected behaviour is clear - match any string comprised entirely of one or more consecutive occurrences of 'a' and/or 'XaX'.

As written though\, it isn't clear what the following '\2' should match when the (.)? does not capture - it could match an empty string (as it does when the '?' is inside)\, or it could match the value it captured the last time round (which is what it did in 5.6.0\, hence the match on "bababdad")\, or it could refuse to match anything at all.

Behaviour from 5.6.1 onwards appears to take the last option​: when trying a path in which the (.)? does not match\, a subsequent attempt to match \2 fails. I'm not sure whether this is explicitly documented anywhere\, nor whether it is tested. It feels like reasonable behaviour though\, since the other sensible behaviour is catered for by (.?)\, and I'd class the original behaviour of 5.6.0 (retain the previous value of $2) as a clear bug.

It is however slightly unfortunate\, in that the alternative definition makes it possible to match against captures that would only be defined later​:   perl -wle 'print "triangular" if ("x" x shift(@​ARGV)) =~ /^(\1x)*$/' 10

Such forms can always be modified to give the desire behaviour though\, and usually with only a small change; in the above case\, for example\, you can use /^((^|\1)x)*$/.

Hugo

p5pRT commented 20 years ago

@smpeters - Status changed from 'rejected' to 'open'

p5pRT commented 20 years ago

From @ysth

On Sat\, Sep 18\, 2004 at 12​:16​:05AM +0100\, hv@​crypt.org wrote​:

As written though\, it isn't clear what the following '\2' should match when the (.)? does not capture - it could match an empty string (as it does when the '?' is inside)\, or it could match the value it captured the last time round (which is what it did in 5.6.0\, hence the match on "bababdad")\, or it could refuse to match anything at all.

Behaviour from 5.6.1 onwards appears to take the last option​: when trying a path in which the (.)? does not match\, a subsequent attempt to match \2 fails. I'm not sure whether this is explicitly documented anywhere\, nor whether it is tested. It feels like reasonable behaviour though\, since the other sensible behaviour is catered for by (.?)\, and I'd class the original behaviour of 5.6.0 (retain the previous value of $2) as a clear bug.

I second the reasonableness of this behaviour.

p5pRT commented 20 years ago

From @smpeters

On Sunday 19 September 2004 02​:18 pm\, Yitzchak Scott-Thoennes wrote​:

On Sat\, Sep 18\, 2004 at 12​:16​:05AM +0100\, hv@​crypt.org wrote​:

As written though\, it isn't clear what the following '\2' should match when the (.)? does not capture - it could match an empty string (as it does when the '?' is inside)\, or it could match the value it captured the last time round (which is what it did in 5.6.0\, hence the match on "bababdad")\, or it could refuse to match anything at all.

Behaviour from 5.6.1 onwards appears to take the last option​: when trying a path in which the (.)? does not match\, a subsequent attempt to match \2 fails. I'm not sure whether this is explicitly documented anywhere\, nor whether it is tested. It feels like reasonable behaviour though\, since the other sensible behaviour is catered for by (.?)\, and I'd class the original behaviour of 5.6.0 (retain the previous value of $2) as a clear bug.

I second the reasonableness of this behaviour.

With all this being said\, is it agreed that the current behavior is correct\, but that we need test conditions and documentation to insure the continuation of this behavior going forward?

Steve Peters steve@​fisharerojo.org

p5pRT commented 20 years ago

From @hvds

Steve Peters \steve@​fisharerojo\.org wrote​: :With all this being said\, is it agreed that the current behavior is correct\, :but that we need test conditions and documentation to insure the continuation :of this behavior going forward?

I think so\, but I haven't actually checked the documentation and existing tests to verify that there is currently a gap.

(I've also only just noticed that the original bug report came from me\, but luckily my previous response is no different than it would have been with that knowledge. :)

I note that my subsequent comment is still true with latest bleadperl​:

zen% ./perl -Ilib -Dr -wle '/^((.)?a\2)+$/ and print "$_​: -$1-$2-" for qw/ babadad /' 2>&1 | grep restoring   restoring \1 to 0(0)..3   restoring \2 to 0(1882132400)..1 zen%

.. and I assume my analysis from then is also still true​:

:That odd number turns out to occur because PL_reg_start_tmp[paren] :is \x78455220 = " REx".

Hugo

p5pRT commented 17 years ago

From @rurban

Trying to close old RT tickets...

Attached patch adds the missing test for this new behaviour.

t/op/re_tests​: test new behaviour for bug #3589 to close it. Todo​: documentation. See http​://rt.perl.org/rt3/Public/Bug/Display.html?id=3589 -- Reini Urban

p5pRT commented 17 years ago

From @rurban

pl-bug3589.patch ```diff difforig perl-current/t/op/re_tests 2007-07-02 Reini Urban * t/op/re_tests: test correct new behaviour for bug #3589 to close it. Todo: documentation. See http://rt.perl.org/rt3/Public/Bug/Display.html?id=3589 diff -ub perl-current/t/op/re_tests.orig perl-current/t/op/re_tests --- perl-current/t/op/re_tests.orig 2007-06-18 15:14:31.000000000 +0000 +++ perl-current/t/op/re_tests 2007-07-02 21:50:56.430125000 +0000 @@ -261,6 +261,8 @@ ((\3|b)\2(a)x)+ aaxabxbaxbbx n - - ((\3|b)\2(a)x)+ aaaxabaxbaaxbbax y $&-$1-$2-$3 bbax-bbax-b-a ((\3|b)\2(a)){2,} bbaababbabaaaaabbaaaabba y $&-$1-$2-$3 bbaaaabba-bba-b-a +#Bug #3589 - up to perl-5.6.0 matches incorrectly, from 5.6.1 not anymore +^((.)?a\2)+$ babadad n - - (a)|(b) b y $-[0] 0 (a)|(b) b y $+[0] 1 (a)|(b) b y x$-[1] x ```
p5pRT commented 17 years ago

From @rgs

On 02/07/07\, Reini Urban via RT \perlbug\-followup@​perl\.org wrote​:

Trying to close old RT tickets...

Attached patch adds the missing test for this new behaviour.

Thanks\, applied as #31530.

p5pRT commented 16 years ago

From module@renee-baecker.de

On Mi. 04. Jul. 2007\, 06​:39​:37\, rafael wrote​:

On 02/07/07\, Reini Urban via RT \perlbug\-followup@​perl\.org wrote​:

Trying to close old RT tickets...

Attached patch adds the missing test for this new behaviour.

Thanks\, applied as #31530.

Attached is a patch for perlre.pod. If this patch is ok\, this ticket can be closed...

p5pRT commented 16 years ago

From module@renee-baecker.de

perlre_pod.patch ```diff --- old/perl-5.10.0/pod/perlre.pod 2007-12-18 11:47:08.000000000 +0100 +++ perl-5.10.0/pod/perlre.pod 2008-05-27 10:33:47.000000000 +0200 @@ -599,6 +599,11 @@ which makes it easier to write code that tests for a series of more specific cases and remembers the best match. +B: If an optional match (e.g. C<(.)?>) does not match, the +subsequent attempt to match the correspondent capture buffer fails. + + 'bababdad' =~ /^((.)?a\2)+$/ + B: Once Perl sees that you need one of C<$&>, C<$`>, or C<$'> anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl ```
p5pRT commented 16 years ago

From p5p@perl.wizbit.be

Quoting reneeb via RT \perlbug\-followup@&#8203;perl\.org​:

On Mi. 04. Jul. 2007\, 06​:39​:37\, rafael wrote​:

On 02/07/07\, Reini Urban via RT \perlbug\-followup@&#8203;perl\.org wrote​:

Trying to close old RT tickets...

Attached patch adds the missing test for this new behaviour.

Thanks\, applied as #31530.

Attached is a patch for perlre.pod. If this patch is ok\, this ticket can be closed...

I feel that the example should be clearer or it needs more explenation...

From the example I can't immediatly tell what is happening and/or
what it special about it that it deserves a \. (I'm sure that
reading the bug report will clarify this but that shouldn't be a
requirement)

Kind regards\,

Bram

p5pRT commented 14 years ago

From @gannett-ggreer

The remaining part of this ticket is clarifying in the documentation the behavior of back-references to optional captures. For example​:

./perl -Ilib -wle '/^((.)?a\2)+$/ and print "$_​: -$1-$2-" for qw/ babadad /'

does not match\, because the "\2" always fails (rather than allowing an empty match) if the "." didn't match. However​:

./perl -Ilib -wle '/^((.?)a\2)+$/ and print "$_​: -$1-$2-" for qw/ babadad /'

does match\, because an empty result is valid in the capture.

-- George Greer

p5pRT commented 7 years ago

From @jkeenan

On Sun\, 25 Jul 2010 19​:21​:33 GMT\, greerga wrote​:

The remaining part of this ticket is clarifying in the documentation the behavior of back-references to optional captures. For example​:

./perl -Ilib -wle '/^((.)?a\2)+$/ and print "$_​: -$1-$2-" for qw/ babadad /'

does not match\, because the "\2" always fails (rather than allowing an empty match) if the "." didn't match. However​:

./perl -Ilib -wle '/^((.?)a\2)+$/ and print "$_​: -$1-$2-" for qw/ babadad /'

does match\, because an empty result is valid in the capture.

I believe this ticket is closable because the additional documentation needed was\, in my reading\, added by Karl Williamson in commit d8b950dc in June 2010. 'pod/perlre.pod' has a section on Capture Groups which contains this​:

##### The bracketing construct "( ... )" creates capture groups (also referred to as capture buffers). To refer to the current contents of a group later on\, within the same pattern\, use "\g1" (or "\g{1}") for the first\, "\g2" (or "\g{2}") for the second\, and so on. This is called a *backreference*. There is no limit to the number of captured substrings that you may use. Groups are numbered with the leftmost open parenthesis being number 1\, etc. If a group did not match\, the associated backreference won't match either. (This can happen if the group is optional\, or in a different branch of an alternation.) #####

I'll put this ticket out of its misery within 7 days unless someone wants to carry on the discussion.

Thank you very much.

-- James E Keenan (jkeenan@​cpan.org)

p5pRT commented 7 years ago

From @demerphq

On 28 December 2016 at 03​:15\, James E Keenan via RT \perlbug\-followup@&#8203;perl\.org wrote​:

On Sun\, 25 Jul 2010 19​:21​:33 GMT\, greerga wrote​:

The remaining part of this ticket is clarifying in the documentation the behavior of back-references to optional captures. For example​:

./perl -Ilib -wle '/^((.)?a\2)+$/ and print "$_​: -$1-$2-" for qw/ babadad /'

does not match\, because the "\2" always fails (rather than allowing an empty match) if the "." didn't match. However​:

./perl -Ilib -wle '/^((.?)a\2)+$/ and print "$_​: -$1-$2-" for qw/ babadad /'

does match\, because an empty result is valid in the capture.

I believe this ticket is closable because the additional documentation needed was\, in my reading\, added by Karl Williamson in commit d8b950dc in June 2010. 'pod/perlre.pod' has a section on Capture Groups which contains this​:

##### The bracketing construct "( ... )" creates capture groups (also referred to as capture buffers). To refer to the current contents of a group later on\, within the same pattern\, use "\g1" (or "\g{1}") for the first\, "\g2" (or "\g{2}") for the second\, and so on. This is called a *backreference*. There is no limit to the number of captured substrings that you may use. Groups are numbered with the leftmost open parenthesis being number 1\, etc. If a group did not match\, the associated backreference won't match either. (This can happen if the group is optional\, or in a different branch of an alternation.) #####

I think that should be beefed up a bit to say something like

Note that a quantified group which matches zero things counts as a "non match"\, so none of the following will match​:

"aba"=~/a(x)*b\1a/ "aba"=~/a(x)?b\1a/

However when the quantifier is on the inside of the group it will​:

"aba"=~/a(x*)b\1a/ "aba"=~/a(x?)b\1a/

Put another way\, perl's regex engine distinguishes between "captured the empty string" and "an optional capture did not match".

You can see this in the following code​:

$ perl -le'if ("aba"=~/a(x*)b/) { print "matched​: "\, defined($1) ? "\<$1>" : "undef" } else { print "rejected" }' matched​: \<>

$ perl -le'if ("aba"=~/a(x)*b/) { print "matched​: "\, defined($1) ? "\<$1>" : "undef" } else { print "rejected" }' matched​: undef

Both patterns match\, but leave the contents of the $1 capture group in a different state. Back references only match when the capture group they reference contained a defined value.

Yves -- perl -Mre=debug -e "/just|another|perl|hacker/"