Ancient Regex Regression

p5pRT commented 5 years ago

Migrated from rt.perl.org#133352 (status was 'open')

Searchable as RT133352$

p5pRT commented 5 years ago

From @deven

Created by @deven

I discovered a bug in Perl's regular expression engine a few months ago. I showed it to many people at The Perl Conference in Salt Lake City a couple weeks ago\, and everyone agreed that this was a bug in the regex engine in Perl itself\, including Abigail\, Tom Christiansen\, Karl Williamson and Larry Wall.

I even ended up doing a lightning talk about the bug:

https://www.youtube.com/watch?v=U-JhPIECkPY

This was my test case\, which works with or without anchors:

"afoobar" =~ /((.)foo|bar)*/ "afoobar" =~ /^((.)foo|bar)*$/

Or\, as a standalone command:

perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

This prints "b"\, even though "bfoo" never appears in "afoobar"!

I understand why this is happening -- the inner group does match against "b" in "bar" on the second iteration\, but this branch of the alternation fails. The capture is still being used\, despite the fact that it came from a failed branch of the alternation.

The correct answer seems to be "a"\, since that's the last match of the inner group and the overall match is successful. Perl 1.0 can't handle this regex (Larry said it was the regex engine from Gosling Emacs)\, but Perl 2.0 through Perl 5.0 alpha 9 all print "a" for the command above. Other regex implementations\, such as PCRE\, RE2\, GNU and others\, also return "a" for the inner group.

Perl 5.000 (from 1994) is the first commit in the git repository (commit a0d0e21ea6ea90a22318550944fe6cb09ae10cda) which exhibits the bug\, printing "b" instead of "a". I just built blead again today and confirmed that the bug is still there\, despite passing the full test suite. (Tom Christiansen pointed out that this bug is technically a regression\, since it used to work correctly.)

Even though I have never worked on the Perl core\, and I've been warned that the regex engine is particularly difficult\, I would still like to attempt to develop a patch for this bug myself.

I've already managed to create a working patch that fixes this bug without breaking any of the regular expression tests in the test suite\, so I think I'm on the right track\, but I think there may be a few edge cases to consider\, so I'm not ready to submit the patch just yet.

Yves\, SawyerX thought you might be willing to mentor me on this? If so\, that would be great!

My solution involves saving the captures with regcppush() on BRANCH and TRIE nodes and restoring them with regcp_restore() at BRANCH_next_fail and TRIE_next_fail. Does that sound like the right approach\, give or take?

Perl Info

``` Flags: category=core severity=low Site configuration information for perl 5.29.1: Configured by deven at Sat Jul 7 22:07:54 EDT 2018. Summary of my perl5 (revision 5 version 29 subversion 1) configuration: Commit id: 71525f77826ad33944c007b06b68a1f14a085e7a Platform: osname=linux osvers=3.19.8-100.fc20.x86_64 archname=x86_64-linux-thread-multi uname='linux twist.ties.org 3.19.8-100.fc20.x86_64 #1 smp tue may 12 17:08:50 utc 2015 x86_64 x86_64 x86_64 gnulinux ' config_args='' hint=previous useposix=true d_sigaction=define useithreads=define usemultiplicity=define use64bitint=define use64bitall=define uselongdouble=undef usemymalloc=n default_inc_excludes_dot=define bincompat5005=undef Compiler: cc='gcc' ccflags ='-DDEBUGGING -D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64' optimize='-g' cppflags='-DDEBUGGING -D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include' ccversion='' gccversion='4.8.3 20140911 (Red Hat 4.8.3-7)' gccosandvers='' intsize=4 longsize=8 ptrsize=8 doublesize=8 byteorder=12345678 doublekind=3 d_longlong=define longlongsize=8 d_longdbl=define longdblsize=16 longdblkind=3 ivtype='long' ivsize=8 nvtype='double' nvsize=8 Off_t='off_t' lseeksize=8 alignbytes=8 prototype=define Linker and Libraries: ld='gcc' ldflags =' -fstack-protector-strong -L/usr/local/lib' libpth=/usr/local/lib /usr/lib /lib/../lib64 /usr/lib/../lib64 /lib /lib64 /usr/lib64 /usr/local/lib64 /usr/local/lib /usr/lib libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc libc=libc-2.18.so so=so useshrplib=false libperl=libperl.a gnulibc_version='2.18' Dynamic Linking: dlsrc=dl_dlopen.xs dlext=so d_dlsymun=undef ccdlflags='-Wl,-E' cccdlflags='-fPIC' lddlflags='-shared -g -L/usr/local/lib -fstack-protector-strong' @INC for perl 5.29.1: lib /usr/local/lib/perl5/site_perl/5.29.1/x86_64-linux-thread-multi /usr/local/lib/perl5/site_perl/5.29.1 /usr/local/lib/perl5/5.29.1/x86_64-linux-thread-multi /usr/local/lib/perl5/5.29.1 Environment for perl 5.29.1: HOME=/home/deven LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/deven/bin:/home/deven/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/etc:/sbin:/home/deven/bin PERL_BADLANG (unset) SHELL=/bin/bash ```

p5pRT commented 5 years ago

From @jkeenan

On Mon\, 09 Jul 2018 04:04:32 GMT\, dcorzine wrote:

This is a bug report for perl from deven@ties.org\, generated with the help of perlbug 1.41 running under perl 5.29.1.

----------------------------------------------------------------- [Please describe your issue here]

I discovered a bug in Perl's regular expression engine a few months ago. I showed it to many people at The Perl Conference in Salt Lake City a couple weeks ago\, and everyone agreed that this was a bug in the regex engine in Perl itself\, including Abigail\, Tom Christiansen\, Karl Williamson and Larry Wall.

I even ended up doing a lightning talk about the bug:

https://www.youtube.com/watch?v=U-JhPIECkPY

This was my test case\, which works with or without anchors:

"afoobar" =~ /((.)foo|bar)*/ "afoobar" =~ /^((.)foo|bar)*$/

Or\, as a standalone command:

perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

This prints "b"\, even though "bfoo" never appears in "afoobar"!

I understand why this is happening -- the inner group does match against "b" in "bar" on the second iteration\, but this branch of the alternation fails. The capture is still being used\, despite the fact that it came from a failed branch of the alternation.

The correct answer seems to be "a"\, since that's the last match of the inner group and the overall match is successful. Perl 1.0 can't handle this regex (Larry said it was the regex engine from Gosling Emacs)\, but Perl 2.0 through Perl 5.0 alpha 9 all print "a" for the command above. Other regex implementations\, such as PCRE\, RE2\, GNU and others\, also return "a" for the inner group.

Perl 5.000 (from 1994) is the first commit in the git repository (commit a0d0e21ea6ea90a22318550944fe6cb09ae10cda) which exhibits the bug\, printing "b" instead of "a". I just built blead again today and confirmed that the bug is still there\, despite passing the full test suite. (Tom Christiansen pointed out that this bug is technically a regression\, since it used to work correctly.)

Even though I have never worked on the Perl core\, and I've been warned that the regex engine is particularly difficult\, I would still like to attempt to develop a patch for this bug myself.

I've already managed to create a working patch that fixes this bug without breaking any of the regular expression tests in the test suite\, so I think I'm on the right track\, but I think there may be a few edge cases to consider\, so I'm not ready to submit the patch just yet.

Please submit the patch as an attachment. Even if it's not complete\, it gives us something to run through the test suite and a starting point for discussion.

Yves\, SawyerX thought you might be willing to mentor me on this? If so\, that would be great!

My solution involves saving the captures with regcppush() on BRANCH and TRIE nodes and restoring them with regcp_restore() at BRANCH_next_fail and TRIE_next_fail. Does that sound like the right approach\, give or take?

Thank you very much.

-- James E Keenan (jkeenan@cpan.org)

p5pRT commented 5 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 5 years ago

From @deven

On Tue\, Jul 10\, 2018 at 1:09 PM\, James E Keenan via RT \< perlbug-followup@perl.org> wrote:

On Mon\, 09 Jul 2018 04:04:32 GMT\, dcorzine wrote:

This is a bug report for perl from deven@ties.org\, generated with the help of perlbug 1.41 running under perl 5.29.1.

[...]

Please submit the patch as an attachment. Even if it's not complete\, it gives us something to run through the test suite and a starting point for discussion.

Fair enough. I'll do that now.

Deven

p5pRT commented 5 years ago

From @deven

On Tue\, Jul 10\, 2018 at 1:09 PM\, James E Keenan via RT \< perlbug-followup@perl.org> wrote:

On Mon\, 09 Jul 2018 04:04:32 GMT\, dcorzine wrote:

This is a bug report for perl from deven@ties.org\, generated with the help of perlbug 1.41 running under perl 5.29.1.

[...]

Please submit the patch as an attachment. Even if it's not complete\, it gives us something to run through the test suite and a starting point for discussion.

Fair enough. I'll do that now.

Deven

p5pRT commented 5 years ago

From @hvds

On Sun\, 08 Jul 2018 21:04:32 -0700\, dcorzine wrote:

This was my test case\, which works with or without anchors:

"afoobar" =~ /((.)foo|bar)*/ "afoobar" =~ /^((.)foo|bar)*$/

Or\, as a standalone command:

perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

This prints "b"\, even though "bfoo" never appears in "afoobar"! [...] The correct answer seems to be "a"\, since that's the last match of the inner group and the overall match is successful.

It's by no means clear to me that that must be the correct answer; the other candidate worth considering is that\, since in /(...)*/ we return the match for the last iteration of the group\, you'd expect further captures embedded within there also to deliver the version from the last iteration of the group.

Under that interpretation\, the correct answer would be undef\, and the earlier releases that returned 'a' were merely more subtly wrong than the current release.

The fact that other examples don't match such an interpretation argues against it: % perl -wle 'use Data::Dumper; print Dumper([ "foobar" =~ /((foo)|(bar))*/ ])' $VAR1 = [ 'bar'\, 'foo'\, 'bar' ]; % .. but I don't recall seeing docs to justify such results. I'll have to have another wade through them.

Hugo

p5pRT commented 5 years ago

From @deven

This is a first pass at a patch to fix RT #133352\, to save and restore all captures during alternations (BRANCH and TRIE) to avoid incorrectly returning a capture from a failed branch\, as happens with the following test case\, which has been broken ever since Perl 5.000 in 1994:

perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command\, while Perl 5.000 through today's blead all incorrectly print "b" instead.

This patch appears to work (making the test case above print "a" again)\, and I am submitting it as a preliminary patch for discussion purposes; I don't consider it ready to apply yet\, for a number of reasons:

* I haven't finished analyzing the code to check for edge cases that I may have overlooked with this patch. Since TRIE_next_fail included a test for ST.jump which would call UNWIND_PAREN()\, I want to better understand how that was working and determine whether ST.jump being set might have any implications for this patch.

* I haven't developed any new regression tests to exercise this bug and the patch to fix it\, and I know that both BRANCH and TRIE variations need to be tested.

* This patch is currently saving ALL captures\, including captures that can't possibly change inside the alternation. This is an easier fix\, but I believe the performance impact should be mitigated\, since this would impact many everyday regular expressions which aren't affected by this bug in the first place.

* I believe that setting a paren floor (which CURLYX does) would likely mitigate most (but not all) of the performance impact of this fix\, but I haven't yet found a clean way to implement this.

* Another possibility might be to create new regnode types to use only for alternations containing capture groups\, maybe only for ones which also have a quantifier applied\, since I believe both are required to trigger this bug. (This might have to be treated as an optimization in regcomp.c to use the old logic for alternations that can't trigger the bug\, but it might be hard to tell if a quantifier is applied for qr// cases.)

* Regardless\, I would like feedback from an expert on this part of the regular expression engine (maybe Yves?) about this patch and whether the approach I've taken is on the right track.

regexec.c | 28 ++++++++++++++++------------ regexp.h | 3 +++ 2 files changed\, 19 insertions(+)\, 12 deletions(-)

p5pRT commented 5 years ago

From @deven

perl-133136-test1.patch

```diff From 1a53f4dc4ab5032ea9cd0bf254adc0902695daa4 Mon Sep 17 00:00:00 2001 From: "Deven T. Corzine" Date: Sat, 23 Jun 2018 23:17:03 -0400 Subject: [PATCH] Save and restore captures for BRANCH/TRIE failures This is a first pass at a patch to fix RT #133352, to save and restore all captures during alternations (BRANCH and TRIE) to avoid incorrectly returning a capture from a failed branch, as happens with the following test case, which has been broken ever since Perl 5.000 in 1994: perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;' Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command, while Perl 5.000 through today's blead all incorrectly print "b" instead. This patch appears to work (making the test case above print "a" again), and I am submitting it as a preliminary patch for discussion purposes; I don't consider it ready to apply yet, for a number of reasons: * I haven't finished analyzing the code to check for edge cases that I may have overlooked with this patch. Since TRIE_next_fail included a test for ST.jump which would call UNWIND_PAREN(), I want to better understand how that was working and determine whether ST.jump being set might have any implications for this patch. * I haven't developed any new regression tests to exercise this bug and the patch to fix it, and I know that both BRANCH and TRIE variations need to be tested. * This patch is currently saving ALL captures, including captures that can't possibly change inside the alternation. This is an easier fix, but I believe the performance impact should be mitigated, since this would impact many everyday regular expressions which aren't affected by this bug in the first place. * I believe that setting a paren floor (which CURLYX does) would likely mitigate most (but not all) of the performance impact of this fix, but I haven't yet found a clean way to implement this. * Another possibility might be to create new regnode types to use only for alternations containing capture groups, maybe only for ones which also have a quantifier applied, since I believe both are required to trigger this bug. (This might have to be treated as an optimization in regcomp.c to use the old logic for alternations that can't trigger the bug, but it might be hard to tell if a quantifier is applied for qr// cases.) * Regardless, I would like feedback from an expert on this part of the regular expression engine (maybe Yves?) about this patch and whether the approach I've taken is on the right track. --- regexec.c | 28 ++++++++++++++++------------ regexp.h | 3 +++ 2 files changed, 19 insertions(+), 12 deletions(-) diff --git a/regexec.c b/regexec.c index ba52ae9..1d42b0d 100644 --- a/regexec.c +++ b/regexec.c @@ -5920,6 +5920,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) HV * widecharmap = MUTABLE_HV(rexi->data->data[ ARG( scan ) + 1 ]); U32 state = trie->startstate; + /* save state of captures at branch start */ + ST.cp = regcppush(rex, 0, maxopenparen); + REGCP_SET(ST.lastcp); + if (scan->flags == EXACTL || scan->flags == EXACTFLU8) { _CHECK_AND_WARN_PROBLEMATIC_LOCALE; if (utf8_target @@ -6067,15 +6071,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) case TRIE_next_fail: /* we failed - try next alternative */ { U8 *uc; - if ( ST.jump ) { - /* undo any captures done in the tail part of a branch, - * e.g. - * /(?:X(.)(.)|Y(.)).../ - * where the trie just matches X then calls out to do the - * rest of the branch */ - REGCP_UNWIND(ST.cp); - UNWIND_PAREN(ST.lastparen, ST.lastcloseparen); - } + + /* restore state of captures from branch start */ + regcp_restore(rex, ST.lastcp, &maxopenparen); + if (!--ST.accepted) { DEBUG_EXECUTE_r({ Perl_re_exec_indentf( aTHX_ "%sTRIE failed...%s\n", @@ -8043,7 +8042,10 @@ NULL ST.lastparen = rex->lastparen; ST.lastcloseparen = rex->lastcloseparen; ST.next_branch = next; - REGCP_SET(ST.cp); + + /* save state of captures at branch start */ + ST.cp = regcppush(rex, 0, maxopenparen); + REGCP_SET(ST.lastcp); /* Now go into the branch */ if (has_cutgroup) { @@ -8077,8 +8079,10 @@ NULL do_cutgroup = 0; no_final = 0; } - REGCP_UNWIND(ST.cp); - UNWIND_PAREN(ST.lastparen, ST.lastcloseparen); + + /* restore state of captures from branch start */ + regcp_restore(rex, ST.lastcp, &maxopenparen); + scan = ST.next_branch; /* no more branches? */ if (!scan || (OP(scan) != BRANCH && OP(scan) != BRANCHJ)) { diff --git a/regexp.h b/regexp.h index 44409f0..aa61c97 100644 --- a/regexp.h +++ b/regexp.h @@ -754,6 +754,7 @@ typedef struct regmatch_state { U32 lastparen; U32 lastcloseparen; CHECKPOINT cp; + CHECKPOINT lastcp; } branchlike; @@ -763,6 +764,7 @@ typedef struct regmatch_state { U32 lastparen; U32 lastcloseparen; CHECKPOINT cp; + CHECKPOINT lastcp; regnode *next_branch; /* next branch node */ } branch; @@ -773,6 +775,7 @@ typedef struct regmatch_state { U32 lastparen; U32 lastcloseparen; CHECKPOINT cp; + CHECKPOINT lastcp; U32 accepted; /* how many accepting states left */ bool longfold;/* saw a fold with a 1->n char mapping */ -- 2.7.4 ```

p5pRT commented 5 years ago

From @deven

On Tue\, 10 Jul 2018 12:19:30 -0700\, hv wrote:

On Sun\, 08 Jul 2018 21:04:32 -0700\, dcorzine wrote:

This was my test case\, which works with or without anchors:

"afoobar" =~ /((.)foo|bar)*/ "afoobar" =~ /^((.)foo|bar)*$/

Or\, as a standalone command:

perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

This prints "b"\, even though "bfoo" never appears in "afoobar"! [...] The correct answer seems to be "a"\, since that's the last match of the inner group and the overall match is successful.

It's by no means clear to me that that must be the correct answer; the other candidate worth considering is that\, since in /(...)*/ we return the match for the last iteration of the group\, you'd expect further captures embedded within there also to deliver the version from the last iteration of the group.

Under that interpretation\, the correct answer would be undef\, and the earlier releases that returned 'a' were merely more subtly wrong than the current release.

Yeah\, that's why I said the correct "seems" to be "a". There's a decent argument for returning undef\, and it's certainly counter-intuitive to some degree to have $1="bar" from the second iteration and $2="a" from the first iteration\, but the inner group does successfully match during the first iteration only\, so "a" is indeed the last successful match.

I see two different viewpoints here\, and both are quite reasonable. One viewpoint is that these are nested groups and both should be returning results from the same iteration\, and therefore $2 should be undef because the nested match doesn't match anything on that iteration. Intuitively\, this feels right. The other viewpoint is that each group can match multiple times and we only get to keep one capture per group\, so $2 should be "a" since that's the last successful match for that group\, despite the confusion of matching $1 and $2 from different iterations.

Personally\, the $2="a" viewpoint seems like a stronger argument to me\, but I could be in the minority in thinking that. I like the $2=undef viewpoint too\, but we can't have both. In terms of how regular expressions are defined and documented\, I'm having trouble rationalizing $2=undef even though it sounds good.

Is there anything definitive in the documentation that would resolve the question without ambiguity? I haven't found it yet\, if there is.

For what it's worth\, other regular expression engines like PCRE\, RE2\, GNU and others all return "a"\, which seems to suggest there may be some sort of consensus that "a" is the correct answer\, but maybe it's just a gray area with no definite right answer?

The fact that other examples don't match such an interpretation argues against it: % perl -wle 'use Data::Dumper; print Dumper([ "foobar" =~ /((foo)|(bar))*/ ])' $VAR1 = [ 'bar'\, 'foo'\, 'bar' ]; % .. but I don't recall seeing docs to justify such results. I'll have to have another wade through them.

Hugo

That example is certainly returning $2 and $3 from different loop iterations\, and it's less clear in this case that returning $2=undef would somehow be preferable or more intuitive for this example. Is that what you were getting at?

Deven

p5pRT commented 5 years ago

From @deven

[I originally entered this via RT\, but I didn't see it come though p5p\, so I'm resending directly\, hopefully it's not a duplicate!]

On Tue\, 10 Jul 2018 12:19:30 -0700\, hv wrote:

On Sun\, 08 Jul 2018 21:04:32 -0700\, dcorzine wrote:

This was my test case\, which works with or without anchors:

"afoobar" =~ /((.)foo|bar)*/ "afoobar" =~ /^((.)foo|bar)*$/

Or\, as a standalone command:

perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

This prints "b"\, even though "bfoo" never appears in "afoobar"! [...] The correct answer seems to be "a"\, since that's the last match of the inner group and the overall match is successful.

It's by no means clear to me that that must be the correct answer; the other candidate worth considering is that\, since in /(...)*/ we return the match for the last iteration of the group\, you'd expect further captures embedded within there also to deliver the version from the last iteration of the group.

Under that interpretation\, the correct answer would be undef\, and the earlier releases that returned 'a' were merely more subtly wrong than the current release.

Yeah\, that's why I said the correct "seems" to be "a". There's a decent argument for returning undef\, and it's certainly counter-intuitive to some degree to have $1="bar" from the second iteration and $2="a" from the first iteration\, but the inner group does successfully match during the first iteration only\, so "a" is indeed the last successful match.

I see two different viewpoints here\, and both are quite reasonable. One viewpoint is that these are nested groups and both should be returning results from the same iteration\, and therefore $2 should be undef because the nested match doesn't match anything on that iteration. Intuitively\, this feels right. The other viewpoint is that each group can match multiple times and we only get to keep one capture per group\, so $2 should be "a" since that's the last successful match for that group\, despite the confusion of matching $1 and $2 from different iterations.

Personally\, the $2="a" viewpoint seems like a stronger argument to me\, but I could be in the minority in thinking that. I like the $2=undef viewpoint too\, but we can't have both. In terms of how regular expressions are defined and documented\, I'm having trouble rationalizing $2=undef even though it sounds good.

Is there anything definitive in the documentation that would resolve the question without ambiguity? I haven't found it yet\, if there is.

For what it's worth\, other regular expression engines like PCRE\, RE2\, GNU and others all return "a"\, which seems to suggest there may be some sort of consensus that "a" is the correct answer\, but maybe it's just a gray area with no definite right answer?

The fact that other examples don't match such an interpretation argues against it: % perl -wle 'use Data::Dumper; print Dumper([ "foobar" =~ /((foo)|(bar))*/ ])' $VAR1 = [ 'bar'\, 'foo'\, 'bar' ]; % .. but I don't recall seeing docs to justify such results. I'll have to have another wade through them.

Hugo

That example is certainly returning $2 and $3 from different loop iterations\, and it's less clear in this case that returning $2=undef would somehow be preferable or more intuitive for this example. Is that what you were getting at?

Deven

p5pRT commented 5 years ago

From @deven

Has anyone taken a look at this patch yet? I'd like to hear what others think so far\, thanks!

Deven

On Tue\, Jul 10\, 2018 at 3:19 PM\, Deven T. Corzine via RT \perlbug\-followup@perl\.org wrote:

This is a first pass at a patch to fix RT #133352\, to save and restore all captures during alternations (BRANCH and TRIE) to avoid incorrectly returning a capture from a failed branch\, as happens with the following test case\, which has been broken ever since Perl 5.000 in 1994:
 perl \-e 'print "$2\\n" if "afoobar" =~ /^$\(\.$foo|bar\)\*$/;'
Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command\, while Perl 5.000 through today's blead all incorrectly print "b" instead.

This patch appears to work (making the test case above print "a" again)\, and I am submitting it as a preliminary patch for discussion purposes; I don't consider it ready to apply yet\, for a number of reasons:

* I haven't finished analyzing the code to check for edge cases that I may have overlooked with this patch. Since TRIE_next_fail included a test for ST.jump which would call UNWIND_PAREN()\, I want to better understand how that was working and determine whether ST.jump being set might have any implications for this patch.

* I haven't developed any new regression tests to exercise this bug and the patch to fix it\, and I know that both BRANCH and TRIE variations need to be tested.

* This patch is currently saving ALL captures\, including captures that can't possibly change inside the alternation. This is an easier fix\, but I believe the performance impact should be mitigated\, since this would impact many everyday regular expressions which aren't affected by this bug in the first place.

* I believe that setting a paren floor (which CURLYX does) would likely mitigate most (but not all) of the performance impact of this fix\, but I haven't yet found a clean way to implement this.

* Another possibility might be to create new regnode types to use only for alternations containing capture groups\, maybe only for ones which also have a quantifier applied\, since I believe both are required to trigger this bug. (This might have to be treated as an optimization in regcomp.c to use the old logic for alternations that can't trigger the bug\, but it might be hard to tell if a quantifier is applied for qr// cases.)

* Regardless\, I would like feedback from an expert on this part of the regular expression engine (maybe Yves?) about this patch and whether the approach I've taken is on the right track. --- regexec.c | 28 ++++++++++++++++------------ regexp.h | 3 +++ 2 files changed\, 19 insertions(+)\, 12 deletions(-)

--- via perlbug: queue: perl5 status: open https://rt-archive.perl.org/perl5/Ticket/Display.html?id=133352

From 1a53f4dc4ab5032ea9cd0bf254adc0902695daa4 Mon Sep 17 00:00:00 2001 From: "Deven T. Corzine" \deven@ties\.org Date: Sat\, 23 Jun 2018 23:17:03 -0400 Subject: [PATCH] Save and restore captures for BRANCH/TRIE failures

This is a first pass at a patch to fix RT #133352\, to save and restore all captures during alternations (BRANCH and TRIE) to avoid incorrectly returning a capture from a failed branch\, as happens with the following test case\, which has been broken ever since Perl 5.000 in 1994:
 perl \-e 'print "$2\\n" if "afoobar" =~ /^$\(\.$foo|bar\)\*$/;'
Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command\, while Perl 5.000 through today's blead all incorrectly print "b" instead.

This patch appears to work (making the test case above print "a" again)\, and I am submitting it as a preliminary patch for discussion purposes; I don't consider it ready to apply yet\, for a number of reasons:

* I haven't finished analyzing the code to check for edge cases that I may have overlooked with this patch. Since TRIE_next_fail included a test for ST.jump which would call UNWIND_PAREN()\, I want to better understand how that was working and determine whether ST.jump being set might have any implications for this patch.

* I haven't developed any new regression tests to exercise this bug and the patch to fix it\, and I know that both BRANCH and TRIE variations need to be tested.

* This patch is currently saving ALL captures\, including captures that can't possibly change inside the alternation. This is an easier fix\, but I believe the performance impact should be mitigated\, since this would impact many everyday regular expressions which aren't affected by this bug in the first place.

* I believe that setting a paren floor (which CURLYX does) would likely mitigate most (but not all) of the performance impact of this fix\, but I haven't yet found a clean way to implement this.

* Another possibility might be to create new regnode types to use only for alternations containing capture groups\, maybe only for ones which also have a quantifier applied\, since I believe both are required to trigger this bug. (This might have to be treated as an optimization in regcomp.c to use the old logic for alternations that can't trigger the bug\, but it might be hard to tell if a quantifier is applied for qr// cases.)

* Regardless\, I would like feedback from an expert on this part of the regular expression engine (maybe Yves?) about this patch and whether the approach I've taken is on the right track. --- regexec.c | 28 ++++++++++++++++------------ regexp.h | 3 +++ 2 files changed\, 19 insertions(+)\, 12 deletions(-)

diff --git a/regexec.c b/regexec.c index ba52ae9..1d42b0d 100644 --- a/regexec.c +++ b/regexec.c @@ -5920\,6 +5920\,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo\, char *startpos\, regnode *prog) HV * widecharmap = MUTABLE_HV(rexi->data->data[ ARG( scan ) + 1 ]); U32 state = trie->startstate;

+ /* save state of captures at branch start */ + ST.cp = regcppush(rex\, 0\, maxopenparen); + REGCP_SET(ST.lastcp); + if (scan->flags == EXACTL || scan->flags == EXACTFLU8) { _CHECK_AND_WARN_PROBLEMATIC_LOCALE; if (utf8_target @@ -6067\,15 +6071\,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo\, char *startpos\, regnode *prog) case TRIE_next_fail: /* we failed - try next alternative */ { U8 *uc; - if ( ST.jump ) { - /* undo any captures done in the tail part of a branch\, - * e.g. - * /(?:X(.)(.)|Y(.)).../ - * where the trie just matches X then calls out to do the - * rest of the branch */ - REGCP_UNWIND(ST.cp); - UNWIND_PAREN(ST.lastparen\, ST.lastcloseparen); - } + + /* restore state of captures from branch start */ + regcp_restore(rex\, ST.lastcp\, &maxopenparen); + if (!--ST.accepted) { DEBUG_EXECUTE_r({ Perl_re_exec_indentf( aTHX_ "%sTRIE failed...%s\n"\, @@ -8043\,7 +8042\,10 @@ NULL ST.lastparen = rex->lastparen; ST.lastcloseparen = rex->lastcloseparen; ST.next_branch = next; - REGCP_SET(ST.cp); + + /* save state of captures at branch start */ + ST.cp = regcppush(rex\, 0\, maxopenparen); + REGCP_SET(ST.lastcp);
        /\* Now go into the branch \*/
        if $has\_cutgroup$ \{
@@ -8077\,8 +8079\,10 @@ NULL do_cutgroup = 0; no_final = 0; } - REGCP_UNWIND(ST.cp); - UNWIND_PAREN(ST.lastparen\, ST.lastcloseparen); + + /* restore state of captures from branch start */ + regcp_restore(rex\, ST.lastcp\, &maxopenparen); + scan = ST.next_branch; /* no more branches? */ if (!scan || (OP(scan) != BRANCH && OP(scan) != BRANCHJ)) { diff --git a/regexp.h b/regexp.h index 44409f0..aa61c97 100644 --- a/regexp.h +++ b/regexp.h @@ -754\,6 +754\,7 @@ typedef struct regmatch_state { U32 lastparen; U32 lastcloseparen; CHECKPOINT cp; + CHECKPOINT lastcp;
     \} branchlike;
@@ -763\,6 +764\,7 @@ typedef struct regmatch_state { U32 lastparen; U32 lastcloseparen; CHECKPOINT cp; + CHECKPOINT lastcp;
        regnode \*next\_branch; /\* next branch node \*/
    \} branch;
@@ -773\,6 +775\,7 @@ typedef struct regmatch_state { U32 lastparen; U32 lastcloseparen; CHECKPOINT cp; + CHECKPOINT lastcp;
        U32         accepted; /\* how many accepting states left \*/
        bool        longfold;/\* saw a fold with a 1\->n char mapping \*/
-- 2.7.4

p5pRT commented 5 years ago

From @deven

Has anyone taken a look at this patch yet? I'd like to hear what others think so far\, thanks!

Deven

On Tue\, Jul 10\, 2018 at 3:19 PM\, Deven T. Corzine via RT \perlbug\-followup@perl\.org wrote:

This is a first pass at a patch to fix RT #133352\, to save and restore all captures during alternations (BRANCH and TRIE) to avoid incorrectly returning a capture from a failed branch\, as happens with the following test case\, which has been broken ever since Perl 5.000 in 1994:
 perl \-e 'print "$2\\n" if "afoobar" =~ /^$\(\.$foo|bar\)\*$/;'
Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command\, while Perl 5.000 through today's blead all incorrectly print "b" instead.

This patch appears to work (making the test case above print "a" again)\, and I am submitting it as a preliminary patch for discussion purposes; I don't consider it ready to apply yet\, for a number of reasons:

* I haven't finished analyzing the code to check for edge cases that I may have overlooked with this patch. Since TRIE_next_fail included a test for ST.jump which would call UNWIND_PAREN()\, I want to better understand how that was working and determine whether ST.jump being set might have any implications for this patch.

* I haven't developed any new regression tests to exercise this bug and the patch to fix it\, and I know that both BRANCH and TRIE variations need to be tested.

* This patch is currently saving ALL captures\, including captures that can't possibly change inside the alternation. This is an easier fix\, but I believe the performance impact should be mitigated\, since this would impact many everyday regular expressions which aren't affected by this bug in the first place.

* I believe that setting a paren floor (which CURLYX does) would likely mitigate most (but not all) of the performance impact of this fix\, but I haven't yet found a clean way to implement this.

* Another possibility might be to create new regnode types to use only for alternations containing capture groups\, maybe only for ones which also have a quantifier applied\, since I believe both are required to trigger this bug. (This might have to be treated as an optimization in regcomp.c to use the old logic for alternations that can't trigger the bug\, but it might be hard to tell if a quantifier is applied for qr// cases.)

* Regardless\, I would like feedback from an expert on this part of the regular expression engine (maybe Yves?) about this patch and whether the approach I've taken is on the right track. --- regexec.c | 28 ++++++++++++++++------------ regexp.h | 3 +++ 2 files changed\, 19 insertions(+)\, 12 deletions(-)

--- via perlbug: queue: perl5 status: open https://rt-archive.perl.org/perl5/Ticket/Display.html?id=133352

From 1a53f4dc4ab5032ea9cd0bf254adc0902695daa4 Mon Sep 17 00:00:00 2001 From: "Deven T. Corzine" \deven@ties\.org Date: Sat\, 23 Jun 2018 23:17:03 -0400 Subject: [PATCH] Save and restore captures for BRANCH/TRIE failures

This is a first pass at a patch to fix RT #133352\, to save and restore all captures during alternations (BRANCH and TRIE) to avoid incorrectly returning a capture from a failed branch\, as happens with the following test case\, which has been broken ever since Perl 5.000 in 1994:
 perl \-e 'print "$2\\n" if "afoobar" =~ /^$\(\.$foo|bar\)\*$/;'
Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command\, while Perl 5.000 through today's blead all incorrectly print "b" instead.

This patch appears to work (making the test case above print "a" again)\, and I am submitting it as a preliminary patch for discussion purposes; I don't consider it ready to apply yet\, for a number of reasons:

* I haven't finished analyzing the code to check for edge cases that I may have overlooked with this patch. Since TRIE_next_fail included a test for ST.jump which would call UNWIND_PAREN()\, I want to better understand how that was working and determine whether ST.jump being set might have any implications for this patch.

* I haven't developed any new regression tests to exercise this bug and the patch to fix it\, and I know that both BRANCH and TRIE variations need to be tested.

* This patch is currently saving ALL captures\, including captures that can't possibly change inside the alternation. This is an easier fix\, but I believe the performance impact should be mitigated\, since this would impact many everyday regular expressions which aren't affected by this bug in the first place.

* I believe that setting a paren floor (which CURLYX does) would likely mitigate most (but not all) of the performance impact of this fix\, but I haven't yet found a clean way to implement this.

* Another possibility might be to create new regnode types to use only for alternations containing capture groups\, maybe only for ones which also have a quantifier applied\, since I believe both are required to trigger this bug. (This might have to be treated as an optimization in regcomp.c to use the old logic for alternations that can't trigger the bug\, but it might be hard to tell if a quantifier is applied for qr// cases.)

* Regardless\, I would like feedback from an expert on this part of the regular expression engine (maybe Yves?) about this patch and whether the approach I've taken is on the right track. --- regexec.c | 28 ++++++++++++++++------------ regexp.h | 3 +++ 2 files changed\, 19 insertions(+)\, 12 deletions(-)

diff --git a/regexec.c b/regexec.c index ba52ae9..1d42b0d 100644 --- a/regexec.c +++ b/regexec.c @@ -5920\,6 +5920\,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo\, char *startpos\, regnode *prog) HV * widecharmap = MUTABLE_HV(rexi->data->data[ ARG( scan ) + 1 ]); U32 state = trie->startstate;

+ /* save state of captures at branch start */ + ST.cp = regcppush(rex\, 0\, maxopenparen); + REGCP_SET(ST.lastcp); + if (scan->flags == EXACTL || scan->flags == EXACTFLU8) { _CHECK_AND_WARN_PROBLEMATIC_LOCALE; if (utf8_target @@ -6067\,15 +6071\,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo\, char *startpos\, regnode *prog) case TRIE_next_fail: /* we failed - try next alternative */ { U8 *uc; - if ( ST.jump ) { - /* undo any captures done in the tail part of a branch\, - * e.g. - * /(?:X(.)(.)|Y(.)).../ - * where the trie just matches X then calls out to do the - * rest of the branch */ - REGCP_UNWIND(ST.cp); - UNWIND_PAREN(ST.lastparen\, ST.lastcloseparen); - } + + /* restore state of captures from branch start */ + regcp_restore(rex\, ST.lastcp\, &maxopenparen); + if (!--ST.accepted) { DEBUG_EXECUTE_r({ Perl_re_exec_indentf( aTHX_ "%sTRIE failed...%s\n"\, @@ -8043\,7 +8042\,10 @@ NULL ST.lastparen = rex->lastparen; ST.lastcloseparen = rex->lastcloseparen; ST.next_branch = next; - REGCP_SET(ST.cp); + + /* save state of captures at branch start */ + ST.cp = regcppush(rex\, 0\, maxopenparen); + REGCP_SET(ST.lastcp);
        /\* Now go into the branch \*/
        if $has\_cutgroup$ \{
@@ -8077\,8 +8079\,10 @@ NULL do_cutgroup = 0; no_final = 0; } - REGCP_UNWIND(ST.cp); - UNWIND_PAREN(ST.lastparen\, ST.lastcloseparen); + + /* restore state of captures from branch start */ + regcp_restore(rex\, ST.lastcp\, &maxopenparen); + scan = ST.next_branch; /* no more branches? */ if (!scan || (OP(scan) != BRANCH && OP(scan) != BRANCHJ)) { diff --git a/regexp.h b/regexp.h index 44409f0..aa61c97 100644 --- a/regexp.h +++ b/regexp.h @@ -754\,6 +754\,7 @@ typedef struct regmatch_state { U32 lastparen; U32 lastcloseparen; CHECKPOINT cp; + CHECKPOINT lastcp;
     \} branchlike;
@@ -763\,6 +764\,7 @@ typedef struct regmatch_state { U32 lastparen; U32 lastcloseparen; CHECKPOINT cp; + CHECKPOINT lastcp;
        regnode \*next\_branch; /\* next branch node \*/
    \} branch;
@@ -773\,6 +775\,7 @@ typedef struct regmatch_state { U32 lastparen; U32 lastcloseparen; CHECKPOINT cp; + CHECKPOINT lastcp;
        U32         accepted; /\* how many accepting states left \*/
        bool        longfold;/\* saw a fold with a 1\->n char mapping \*/
-- 2.7.4

p5pRT commented 5 years ago

From @iabyn

On Sun\, Jul 15\, 2018 at 03:52:30PM -0400\, Deven T. Corzine wrote:

Has anyone taken a look at this patch yet? I'd like to hear what others think so far\, thanks!

I plan to look at it at some point\, but I hate and fear the capture save and restore code in regmatch()\, which has kept me away from looking at your patch yet. Maybe in a few days.

-- Art is anything that has a label (especially if the label is "untitled 1")

p5pRT commented 5 years ago

From @iabyn

On Sun\, Jul 15\, 2018 at 03:52:30PM -0400\, Deven T. Corzine wrote:

Has anyone taken a look at this patch yet? I'd like to hear what others think so far\, thanks!

I plan to look at it at some point\, but I hate and fear the capture save and restore code in regmatch()\, which has kept me away from looking at your patch yet. Maybe in a few days.

-- Art is anything that has a label (especially if the label is "untitled 1")

p5pRT commented 5 years ago

From @deven

On Mon\, Jul 16\, 2018 at 4:13 AM\, Dave Mitchell \davem@iabyn\.com wrote:

I plan to look at it at some point\, but I hate and fear the capture save and restore code in regmatch()\, which has kept me away from looking at your patch yet. Maybe in a few days.

I look forward to hearing your feedback\, thanks Dave! That's definitely complex code! It took quite a few hours of study to understand it well enough to make that patch\, though it's only a few lines long.

The patch still needs work -- right now it's saving and restoring ALL the captures\, which is overkill. There's no need to save and restore captures that are outside the alternation\, and that could affect the performance of many regexes which aren't affected by this bug in the first place.

My initial inclination was try setting a paren floor (which CURLYX does)\, but I haven't figured out a clean way to do that yet.

When a regmatch_state is allocated\, is the memory cleared or uninitialized?

Deven

p5pRT commented 5 years ago

From @deven

On Mon\, Jul 16\, 2018 at 4:13 AM\, Dave Mitchell \davem@iabyn\.com wrote:

I plan to look at it at some point\, but I hate and fear the capture save and restore code in regmatch()\, which has kept me away from looking at your patch yet. Maybe in a few days.

I look forward to hearing your feedback\, thanks Dave! That's definitely complex code! It took quite a few hours of study to understand it well enough to make that patch\, though it's only a few lines long.

The patch still needs work -- right now it's saving and restoring ALL the captures\, which is overkill. There's no need to save and restore captures that are outside the alternation\, and that could affect the performance of many regexes which aren't affected by this bug in the first place.

My initial inclination was try setting a paren floor (which CURLYX does)\, but I haven't figured out a clean way to do that yet.

When a regmatch_state is allocated\, is the memory cleared or uninitialized?

Deven

p5pRT commented 5 years ago

From @iabyn

On Tue\, Jul 10\, 2018 at 02:14:03PM -0700\, Deven T. Corzine via RT wrote:

Yeah\, that's why I said the correct "seems" to be "a". There's a decent argument for returning undef

My take on it is based on two (possibly conflicting) ideals.

The first is that when starting the (N+1)th iteration of a '*' expression or similar\, the captures from the Nth iteration are still valid\, until over-written by the (N+1)th iteration. This allows patterns of this form to work:

print "1=[$1]\n" if "AA-ABB-" =~ /^ (?: \1? (\w) \1 - )* $/x

which outputs 'B'.

On the first iteration: the first \1 fails to match anything\, and is skipped via the '?'; the second \1 matches 'A'

On the second iteration: the first \1 matches 'A' - the value from the last iteration the second \1 matches 'B' - the value from the current iteration

The second principle is that following an alternation\, the captures from branches that failed or weren't tried\, are invalid.

Neither of these match:

print "matched\n" if "B" =~ /^ (?: A(.) | B ) \1 $/x; print "matched\n" if "B" =~ /^ (?: A | B | (.) ) \1 $/x;

So the real question is\, at the end of an alternation\, should any 'unused' captures within the alternation be flagged as invalid\, or should they preserved - retaining the values they had at the start of the alternation (which may be real values if this is a second+ iteration of an enclosing '*' etc).

I think you could argue it either way. However\, since this bug has been around since forever\, with no-one apparently noticing it before\, I think can fix it how we like - so we should pick whichever is easiest to implement.

Invalidating captures set by a failing branch involves just knowing the max index at the start and end of the branch execution\, and invalidating everything in between; restoring previous values involves saving a whole set of capture indices and restoring them on failure (which is what I think your patch does). The latter sounds a whole lot more expensive\, and would potentially slow down all alterations.

-- The Enterprise is involved in a bizarre time-warp experience which is in some way unconnected with the Late 20th Century. -- Things That Never Happen in "Star Trek" #14

p5pRT commented 5 years ago

From @iabyn

On Tue\, Jul 10\, 2018 at 02:14:03PM -0700\, Deven T. Corzine via RT wrote:

Yeah\, that's why I said the correct "seems" to be "a". There's a decent argument for returning undef

My take on it is based on two (possibly conflicting) ideals.

The first is that when starting the (N+1)th iteration of a '*' expression or similar\, the captures from the Nth iteration are still valid\, until over-written by the (N+1)th iteration. This allows patterns of this form to work:

print "1=[$1]\n" if "AA-ABB-" =~ /^ (?: \1? (\w) \1 - )* $/x

which outputs 'B'.

On the first iteration: the first \1 fails to match anything\, and is skipped via the '?'; the second \1 matches 'A'

On the second iteration: the first \1 matches 'A' - the value from the last iteration the second \1 matches 'B' - the value from the current iteration

The second principle is that following an alternation\, the captures from branches that failed or weren't tried\, are invalid.

Neither of these match:

print "matched\n" if "B" =~ /^ (?: A(.) | B ) \1 $/x; print "matched\n" if "B" =~ /^ (?: A | B | (.) ) \1 $/x;

So the real question is\, at the end of an alternation\, should any 'unused' captures within the alternation be flagged as invalid\, or should they preserved - retaining the values they had at the start of the alternation (which may be real values if this is a second+ iteration of an enclosing '*' etc).

I think you could argue it either way. However\, since this bug has been around since forever\, with no-one apparently noticing it before\, I think can fix it how we like - so we should pick whichever is easiest to implement.

Invalidating captures set by a failing branch involves just knowing the max index at the start and end of the branch execution\, and invalidating everything in between; restoring previous values involves saving a whole set of capture indices and restoring them on failure (which is what I think your patch does). The latter sounds a whole lot more expensive\, and would potentially slow down all alterations.

-- The Enterprise is involved in a bizarre time-warp experience which is in some way unconnected with the Late 20th Century. -- Things That Never Happen in "Star Trek" #14

p5pRT commented 5 years ago

From @deven

On Tue\, Jul 17\, 2018 at 11:02 AM\, Dave Mitchell \davem@iabyn\.com wrote:

On Tue\, Jul 10\, 2018 at 02:14:03PM -0700\, Deven T. Corzine via RT wrote:

Yeah\, that's why I said the correct "seems" to be "a". There's a decent argument for returning undef

My take on it is based on two (possibly conflicting) ideals.

The first is that when starting the (N+1)th iteration of a '*' expression or similar\, the captures from the Nth iteration are still valid\, until over-written by the (N+1)th iteration. This allows patterns of this form to work:
print "1=\[$1\]\\n" if "AA\-ABB\-" =~ /^ $?&#8203;: \\1? \(\\w$ \\1 \- \)\* $/x
which outputs 'B'.

On the first iteration: the first \1 fails to match anything\, and is skipped via the '?'; the second \1 matches 'A'

On the second iteration: the first \1 matches 'A' - the value from the last iteration the second \1 matches 'B' - the value from the current iteration

Interesting. It's not a forward reference\, but it's sort of an inverted backreference. I don't think I've ever seen that done before\, but it seems to me that this ought to be valid\, unusual though it is.

The second principle is that following an alternation\, the captures from branches that failed or weren't tried\, are invalid.

Neither of these match:
print "matched\\n" if "B" =~ /^ $?&#8203;: A\(\.$ | B \) \\1 $/x;
print "matched\\n" if "B" =~ /^ $?&#8203;: A | B | \(\.$ \) \\1 $/x;

My patch doesn't change the behavior of those examples.

On the other hand\, my patch does allow this example to match:

print "matched\n" if "ABCDA" =~ /^ (?: (.)B | CD )* \1 $/x;

Without my patch\, this matches instead:

print "matched\n" if "ABCDC" =~ /^ (?: (.)B | CD )* \1 $/x;

So the real question is\, at the end of an alternation\, should any 'unused' captures within the alternation be flagged as invalid\, or should they preserved - retaining the values they had at the start of the alternation (which may be real values if this is a second+ iteration of an enclosing '*' etc).

Indeed\, and there is certainly room for debate here\, as both arguments are reasonable.

Personally\, invalidating a successful capture that was part of a successful match feels wrong to me. Consider this example:

"foobar" =~ /^ (?: (foo) | (bar) )* $/x;

Both with and without my patch\, this leaves $1="foo" and $2="bar". If the "unused" capture were to be invalidated\, this would leave $1 undefined instead. Would this be desirable?

I think you could argue it either way. However\, since this bug has been around since forever\, with no-one apparently noticing it before\, I think can fix it how we like - so we should pick whichever is easiest to implement.

That's a good point. Still\, it's worth considering that restoring the original captures is what Perl used to do in Perl 2.0 through Perl 5.0 alpha 9\, and it's also what PCRE\, RE2 and other regex implementations do also. Javascript does seem to invalidate the capture instead\, but that's the only counter-example I've seen so far.

Strangely\, I can't find a single word in the documentation about what happens with captures when quantifiers are applied. We're used to the capture being replaced with the new one\, yet I can't find anything saying that it does that! All of it seems to be unspecified behavior. Is it somewhere I'm missing?

Invalidating captures set by a failing branch involves just knowing the max index at the start and end of the branch execution\, and invalidating everything in between; restoring previous values involves saving a whole set of capture indices and restoring them on failure (which is what I think your patch does). The latter sounds a whole lot more expensive\, and would potentially slow down all alterations.

Yes\, that's what my current patch does\, and there IS a performance issue here\, since I'm currently saving and restoring ALL captures\, whether or not they're in the alternation. This is unnecessary\, but I haven't figured out yet how to set the paren floor correctly to only save the necessary ones. I think that would mitigate most of the performance impact\, though not all of it.

Would it be better to remember the captures some other way? Perl 6 returns ALL the captures; should Perl 5 have that capability too?

Deven

p5pRT commented 5 years ago

From @deven

On Tue\, Jul 17\, 2018 at 11:02 AM\, Dave Mitchell \davem@iabyn\.com wrote:

On Tue\, Jul 10\, 2018 at 02:14:03PM -0700\, Deven T. Corzine via RT wrote:

Yeah\, that's why I said the correct "seems" to be "a". There's a decent argument for returning undef

My take on it is based on two (possibly conflicting) ideals.

The first is that when starting the (N+1)th iteration of a '*' expression or similar\, the captures from the Nth iteration are still valid\, until over-written by the (N+1)th iteration. This allows patterns of this form to work:
print "1=\[$1\]\\n" if "AA\-ABB\-" =~ /^ $?&#8203;: \\1? \(\\w$ \\1 \- \)\* $/x
which outputs 'B'.

On the first iteration: the first \1 fails to match anything\, and is skipped via the '?'; the second \1 matches 'A'

On the second iteration: the first \1 matches 'A' - the value from the last iteration the second \1 matches 'B' - the value from the current iteration

Interesting. It's not a forward reference\, but it's sort of an inverted backreference. I don't think I've ever seen that done before\, but it seems to me that this ought to be valid\, unusual though it is.

The second principle is that following an alternation\, the captures from branches that failed or weren't tried\, are invalid.

Neither of these match:
print "matched\\n" if "B" =~ /^ $?&#8203;: A\(\.$ | B \) \\1 $/x;
print "matched\\n" if "B" =~ /^ $?&#8203;: A | B | \(\.$ \) \\1 $/x;

My patch doesn't change the behavior of those examples.

On the other hand\, my patch does allow this example to match:

print "matched\n" if "ABCDA" =~ /^ (?: (.)B | CD )* \1 $/x;

Without my patch\, this matches instead:

print "matched\n" if "ABCDC" =~ /^ (?: (.)B | CD )* \1 $/x;

So the real question is\, at the end of an alternation\, should any 'unused' captures within the alternation be flagged as invalid\, or should they preserved - retaining the values they had at the start of the alternation (which may be real values if this is a second+ iteration of an enclosing '*' etc).

Indeed\, and there is certainly room for debate here\, as both arguments are reasonable.

Personally\, invalidating a successful capture that was part of a successful match feels wrong to me. Consider this example:

"foobar" =~ /^ (?: (foo) | (bar) )* $/x;

Both with and without my patch\, this leaves $1="foo" and $2="bar". If the "unused" capture were to be invalidated\, this would leave $1 undefined instead. Would this be desirable?

I think you could argue it either way. However\, since this bug has been around since forever\, with no-one apparently noticing it before\, I think can fix it how we like - so we should pick whichever is easiest to implement.

That's a good point. Still\, it's worth considering that restoring the original captures is what Perl used to do in Perl 2.0 through Perl 5.0 alpha 9\, and it's also what PCRE\, RE2 and other regex implementations do also. Javascript does seem to invalidate the capture instead\, but that's the only counter-example I've seen so far.

Strangely\, I can't find a single word in the documentation about what happens with captures when quantifiers are applied. We're used to the capture being replaced with the new one\, yet I can't find anything saying that it does that! All of it seems to be unspecified behavior. Is it somewhere I'm missing?

Invalidating captures set by a failing branch involves just knowing the max index at the start and end of the branch execution\, and invalidating everything in between; restoring previous values involves saving a whole set of capture indices and restoring them on failure (which is what I think your patch does). The latter sounds a whole lot more expensive\, and would potentially slow down all alterations.

Yes\, that's what my current patch does\, and there IS a performance issue here\, since I'm currently saving and restoring ALL captures\, whether or not they're in the alternation. This is unnecessary\, but I haven't figured out yet how to set the paren floor correctly to only save the necessary ones. I think that would mitigate most of the performance impact\, though not all of it.

Would it be better to remember the captures some other way? Perl 6 returns ALL the captures; should Perl 5 have that capability too?

Deven

p5pRT commented 5 years ago

From @davidnicol

This is a good test case for the bug:

On the other hand\, my patch does allow this example to match:

print "matched\n" if "ABCDA" =~ /^ (?: (.)B | CD )* \1 $/x;

Without my patch\, this matches instead:

print "matched\n" if "ABCDC" =~ /^ (?: (.)B | CD )* \1 $/x;

The bug appears to be the result of an optimization of describing the capture buffer with an offset -- essentially a dynamic substring expression -- rather than copying the captured string into it. Were the capture buffer to be copied into\, it would get 'A' (the character before the B) rather than 'C ' (the first character in the match) and the behavior would be the same as what the other regex engines do.

Is that what the patch changes?

-- "At this point\, given the limited available data\, certainty about only a very small number of things can be achieved." -- Plato\, and others

p5pRT commented 5 years ago

From @davidnicol

This is a good test case for the bug:

On the other hand\, my patch does allow this example to match:

print "matched\n" if "ABCDA" =~ /^ (?: (.)B | CD )* \1 $/x;

Without my patch\, this matches instead:

print "matched\n" if "ABCDC" =~ /^ (?: (.)B | CD )* \1 $/x;

The bug appears to be the result of an optimization of describing the capture buffer with an offset -- essentially a dynamic substring expression -- rather than copying the captured string into it. Were the capture buffer to be copied into\, it would get 'A' (the character before the B) rather than 'C ' (the first character in the match) and the behavior would be the same as what the other regex engines do.

Is that what the patch changes?

-- "At this point\, given the limited available data\, certainty about only a very small number of things can be achieved." -- Plato\, and others

p5pRT commented 5 years ago

From @deven

On Tue\, Jul 17\, 2018 at 12:58 PM David Nicol \davidnicol@gmail\.com wrote:

This is a good test case for the bug:

On the other hand\, my patch does allow this example to match:
 print "matched\\n" if "ABCDA" =~ /^ $?&#8203;: \(\.$B | CD \)\* \\1 $/x;
Without my patch\, this matches instead:
 print "matched\\n" if "ABCDC" =~ /^ $?&#8203;: \(\.$B | CD \)\* \\1 $/x;
The bug appears to be the result of an optimization of describing the capture buffer with an offset -- essentially a dynamic substring expression -- rather than copying the captured string into it. Were the capture buffer to be copied into\, it would get 'A' (the character before the B) rather than 'C ' (the first character in the match) and the behavior would be the same as what the other regex engines do.

Is that what the patch changes?

That optimization doesn't cause the bug\, it's the attempt to match the (.) again against "CD" that causes it -- the (.) matches\, but the "D" doesn't\, and it doesn't restore the original capture.

Deven

p5pRT commented 5 years ago

From @deven

On Tue\, Jul 17\, 2018 at 12:58 PM David Nicol \davidnicol@gmail\.com wrote:

This is a good test case for the bug:

On the other hand\, my patch does allow this example to match:
 print "matched\\n" if "ABCDA" =~ /^ $?&#8203;: \(\.$B | CD \)\* \\1 $/x;
Without my patch\, this matches instead:
 print "matched\\n" if "ABCDC" =~ /^ $?&#8203;: \(\.$B | CD \)\* \\1 $/x;
The bug appears to be the result of an optimization of describing the capture buffer with an offset -- essentially a dynamic substring expression -- rather than copying the captured string into it. Were the capture buffer to be copied into\, it would get 'A' (the character before the B) rather than 'C ' (the first character in the match) and the behavior would be the same as what the other regex engines do.

Is that what the patch changes?

That optimization doesn't cause the bug\, it's the attempt to match the (.) again against "CD" that causes it -- the (.) matches\, but the "D" doesn't\, and it doesn't restore the original capture.

Deven

p5pRT commented 5 years ago

From @davidnicol

On Tue\, Jul 17\, 2018 at 12:56 PM Deven T. Corzine \deven@ties\.org wrote:

On the other hand\, my patch does allow this example to match:
 print "matched\\n" if "ABCDA" =~ /^ $?&#8203;: \(\.$B | CD \)\* \\1 $/x;
Without my patch\, this matches instead:
 print "matched\\n" if "ABCDC" =~ /^ $?&#8203;: \(\.$B | CD \)\* \\1 $/x;
That optimization doesn't cause the bug\, it's the attempt to match the (.) again against "CD" that causes it -- the (.) matches\, but the "D" doesn't\, and it doesn't restore the original capture.
Deven

Thank you\, I misunderstood. So in the original demonstration\, the "b" got into $2 before the branch failed because the b was not followed by "foo"\, not due to $2 being internally tracked as an offset\, and as that branch had succeeded\, the capture was assignable.

As the current documentation (the section on "Capture Groups" in https://perldoc.perl.org/perlre.html, accessed just now) states "If a group did not match\, the associated backreference won't match either. (This can happen if the group is optional\, or in a different branch of an alternation.) " there is clearly a bug somewhere. On the other hand\, as CDCDC fails to match the test while CBCDC (unsurprisingly\, but for surprising reason) does\, so there is some kind of "did this match" knowledge happening\, otherwise CDCDC would set \1 to C before failing to match the B\, and the implementation could be interpreted as conformant with the documentation's "if a group did not match" but it takes a lot of squinting.

The current documentation (that section) contains no guidance concerning capture groups within repeating constructs. Honestly\, before today I expected regex constructions like

"abcdef" =~ /(?:(.))+/

to magically create $1 through $6 and load them all. This was erroneous! That's not how it works! The documentation is silent on the matter.

As an opinionated person\, I'm in favor of fixing the regression and including

# we don't clobber capture groups with data from failed alternate branches (although we used to) * ( **"ABCD" =~ /^(?:(.)B|CD)*$/ and $1 eq **( $] ge '5.027' ? 'A' : 'C' ))*

into the test suite and documenting how captures into buffers in alternations that passed in earlier iterations but not the most recent one used to work\, in perldoc perlre.

... After looking at https://rt-archive.perl.org/perl5/Ticket/Attachment/1566563/824618/perl-133136-test1.patch I wonder if it might be possible to defer the assignment into the capture buffers until after branches have succeeded\, rather than resetting them. This approach might require making a set of provisional capture buffers at every juncture that could become a descent into an iterating subregex containing captures\, but wouldn't be vulnerable to only operating correctly at the first level. But maybe the engine already stacks these things so with the patch

$ perl -e ' "ABCDAFCDAD" =~ /(?:(?:(.)B|CD)+|(?:(.)D|A(.))*)+/ and print "$1 > $2 > $3"' C > A > F

will do the right thing\, whatever that is.

Thank you

-- "At this point\, given the limited available data\, certainty about only a very small number of things can be achieved." -- Plato\, and others

p5pRT commented 5 years ago

From @davidnicol

On Tue\, Jul 17\, 2018 at 12:56 PM Deven T. Corzine \deven@ties\.org wrote:

On the other hand\, my patch does allow this example to match:
 print "matched\\n" if "ABCDA" =~ /^ $?&#8203;: \(\.$B | CD \)\* \\1 $/x;
Without my patch\, this matches instead:
 print "matched\\n" if "ABCDC" =~ /^ $?&#8203;: \(\.$B | CD \)\* \\1 $/x;
That optimization doesn't cause the bug\, it's the attempt to match the (.) again against "CD" that causes it -- the (.) matches\, but the "D" doesn't\, and it doesn't restore the original capture.
Deven

Thank you\, I misunderstood. So in the original demonstration\, the "b" got into $2 before the branch failed because the b was not followed by "foo"\, not due to $2 being internally tracked as an offset\, and as that branch had succeeded\, the capture was assignable.

As the current documentation (the section on "Capture Groups" in https://perldoc.perl.org/perlre.html, accessed just now) states "If a group did not match\, the associated backreference won't match either. (This can happen if the group is optional\, or in a different branch of an alternation.) " there is clearly a bug somewhere. On the other hand\, as CDCDC fails to match the test while CBCDC (unsurprisingly\, but for surprising reason) does\, so there is some kind of "did this match" knowledge happening\, otherwise CDCDC would set \1 to C before failing to match the B\, and the implementation could be interpreted as conformant with the documentation's "if a group did not match" but it takes a lot of squinting.

The current documentation (that section) contains no guidance concerning capture groups within repeating constructs. Honestly\, before today I expected regex constructions like

"abcdef" =~ /(?:(.))+/

to magically create $1 through $6 and load them all. This was erroneous! That's not how it works! The documentation is silent on the matter.

As an opinionated person\, I'm in favor of fixing the regression and including

# we don't clobber capture groups with data from failed alternate branches (although we used to) * ( **"ABCD" =~ /^(?:(.)B|CD)*$/ and $1 eq **( $] ge '5.027' ? 'A' : 'C' ))*

into the test suite and documenting how captures into buffers in alternations that passed in earlier iterations but not the most recent one used to work\, in perldoc perlre.

... After looking at https://rt-archive.perl.org/perl5/Ticket/Attachment/1566563/824618/perl-133136-test1.patch I wonder if it might be possible to defer the assignment into the capture buffers until after branches have succeeded\, rather than resetting them. This approach might require making a set of provisional capture buffers at every juncture that could become a descent into an iterating subregex containing captures\, but wouldn't be vulnerable to only operating correctly at the first level. But maybe the engine already stacks these things so with the patch

$ perl -e ' "ABCDAFCDAD" =~ /(?:(?:(.)B|CD)+|(?:(.)D|A(.))*)+/ and print "$1 > $2 > $3"' C > A > F

will do the right thing\, whatever that is.

Thank you

-- "At this point\, given the limited available data\, certainty about only a very small number of things can be achieved." -- Plato\, and others

p5pRT commented 5 years ago

From @iabyn

On Tue\, Jul 17\, 2018 at 11:42:11AM -0400\, Deven T. Corzine wrote:

On Tue\, Jul 17\, 2018 at 11:02 AM\, Dave Mitchell \davem@iabyn\.com wrote:

So the real question is\, at the end of an alternation\, should any 'unused' captures within the alternation be flagged as invalid\, or should they preserved - retaining the values they had at the start of the alternation (which may be real values if this is a second+ iteration of an enclosing '*' etc).

Indeed\, and there is certainly room for debate here\, as both arguments are reasonable.

Personally\, invalidating a successful capture that was part of a successful match feels wrong to me. Consider this example:
 "foobar" =~ /^ $?&#8203;: \(foo$ | $bar$ \)\* $/x;
Both with and without my patch\, this leaves $1="foo" and $2="bar". If the "unused" capture were to be invalidated\, this would leave $1 undefined instead. Would this be desirable?

It would match the (admittedly ambiguous) bit in perlre that David Nicol quoted:

If a group did not match\, the associated backreference won't match either. (This can happen if the group is optional\, or in a different branch of an alternation.)

Yes\, that's what my current patch does\, and there IS a performance issue here\, since I'm currently saving and restoring ALL captures\, whether or not they're in the alternation. This is unnecessary\, but I haven't figured out yet how to set the paren floor correctly to only save the necessary ones. I think that would mitigate most of the performance impact\, though not all of it.

There is considerable performance overhead in saving even just one set of capture indices - the marginal cost of saving more than one is less. So saving fewer is good\, saving none is a *lot* better.

The only ones needing saving or invalidating are the ones with indices lying between lastopen+1 .. maxopenparen (I think\, based on a quick look).

It might be worth writing an alternative patch which does just the invalidation\, rather than saving/restoring\, and see what\, if any\, tests fail. Those failures may give more insight into original intent\, and whether saving is worth it.

I suspect that writing the invalidation code might be quite tricky (in the sense that the eventual code will be simple\, but working out what that code should be exactly may be hard).

Would it be better to remember the captures some other way? Perl 6 returns ALL the captures; should Perl 5 have that capability too?

What do you mean exactly?

-- Little fly\, thy summer's play my thoughtless hand has terminated with extreme prejudice. (with apologies to William Blake)

p5pRT commented 5 years ago

From @iabyn

On Tue\, Jul 17\, 2018 at 11:42:11AM -0400\, Deven T. Corzine wrote:

On Tue\, Jul 17\, 2018 at 11:02 AM\, Dave Mitchell \davem@iabyn\.com wrote:

So the real question is\, at the end of an alternation\, should any 'unused' captures within the alternation be flagged as invalid\, or should they preserved - retaining the values they had at the start of the alternation (which may be real values if this is a second+ iteration of an enclosing '*' etc).

Indeed\, and there is certainly room for debate here\, as both arguments are reasonable.

Personally\, invalidating a successful capture that was part of a successful match feels wrong to me. Consider this example:
 "foobar" =~ /^ $?&#8203;: \(foo$ | $bar$ \)\* $/x;
Both with and without my patch\, this leaves $1="foo" and $2="bar". If the "unused" capture were to be invalidated\, this would leave $1 undefined instead. Would this be desirable?

It would match the (admittedly ambiguous) bit in perlre that David Nicol quoted:

If a group did not match\, the associated backreference won't match either. (This can happen if the group is optional\, or in a different branch of an alternation.)

Yes\, that's what my current patch does\, and there IS a performance issue here\, since I'm currently saving and restoring ALL captures\, whether or not they're in the alternation. This is unnecessary\, but I haven't figured out yet how to set the paren floor correctly to only save the necessary ones. I think that would mitigate most of the performance impact\, though not all of it.

There is considerable performance overhead in saving even just one set of capture indices - the marginal cost of saving more than one is less. So saving fewer is good\, saving none is a *lot* better.

The only ones needing saving or invalidating are the ones with indices lying between lastopen+1 .. maxopenparen (I think\, based on a quick look).

It might be worth writing an alternative patch which does just the invalidation\, rather than saving/restoring\, and see what\, if any\, tests fail. Those failures may give more insight into original intent\, and whether saving is worth it.

I suspect that writing the invalidation code might be quite tricky (in the sense that the eventual code will be simple\, but working out what that code should be exactly may be hard).

Would it be better to remember the captures some other way? Perl 6 returns ALL the captures; should Perl 5 have that capability too?

What do you mean exactly?

-- Little fly\, thy summer's play my thoughtless hand has terminated with extreme prejudice. (with apologies to William Blake)

p5pRT commented 5 years ago

From @davidnicol

Personally\, invalidating a successful capture that was part of a successful match feels wrong to me. Consider this example:
 "foobar" =~ /^ $?&#8203;: \(foo$ | $bar$ \)\* $/x;
Both with and without my patch\, this leaves $1="foo" and $2="bar". If the "unused" capture were to be invalidated\, this would leave $1 undefined instead. Would this be desirable?
It would match the (admittedly ambiguous) bit in perlre that David Nicol quoted:
If a group did not match\, the associated backreference won't match
either\. $This can happen if the group is optional\, or in a different
branch of an alternation\.$

the expression in the capture matching isn't enough; the capture has to match and then the outer expression has to match too. I'd like to think that this can be fixed by rearranging the order of things so the assignment to the capture group is deferred until the branch the capture group is in matches.

"go" matches $1 but does not clobber $1 because "gox" is neither [fg]oo nor bar $ perl -le '"foobargox" =~ /^ (?: ([fg]o)(o) | (bar) )* /x; print "1:$1;2:$2;3:$3"' 1:fo;2:o;3:bar

"go" clobbers $1 even though "gox" is not [fg]oo\, as a later alternative within this iteration matches $ perl -le '"foobargox" =~ /^ (?: ([fg]o)(o) | gox | (bar) )* /x; print "1:$1;2:$2;3:$3"' 1:go;2:o;3:bar

When exactly does the assignment happen?

-- "At this point\, given the limited available data\, certainty about only a very small number of things can be achieved." -- Plato\, and others

p5pRT commented 5 years ago

From @davidnicol

Personally\, invalidating a successful capture that was part of a successful match feels wrong to me. Consider this example:
 "foobar" =~ /^ $?&#8203;: \(foo$ | $bar$ \)\* $/x;
Both with and without my patch\, this leaves $1="foo" and $2="bar". If the "unused" capture were to be invalidated\, this would leave $1 undefined instead. Would this be desirable?
It would match the (admittedly ambiguous) bit in perlre that David Nicol quoted:
If a group did not match\, the associated backreference won't match
either\. $This can happen if the group is optional\, or in a different
branch of an alternation\.$

the expression in the capture matching isn't enough; the capture has to match and then the outer expression has to match too. I'd like to think that this can be fixed by rearranging the order of things so the assignment to the capture group is deferred until the branch the capture group is in matches.

"go" matches $1 but does not clobber $1 because "gox" is neither [fg]oo nor bar $ perl -le '"foobargox" =~ /^ (?: ([fg]o)(o) | (bar) )* /x; print "1:$1;2:$2;3:$3"' 1:fo;2:o;3:bar

"go" clobbers $1 even though "gox" is not [fg]oo\, as a later alternative within this iteration matches $ perl -le '"foobargox" =~ /^ (?: ([fg]o)(o) | gox | (bar) )* /x; print "1:$1;2:$2;3:$3"' 1:go;2:o;3:bar

When exactly does the assignment happen?

-- "At this point\, given the limited available data\, certainty about only a very small number of things can be achieved." -- Plato\, and others

p5pRT commented 5 years ago

From @deven

On Tue\, Jul 17\, 2018 at 3:54 PM\, David Nicol \davidnicol@gmail\.com wrote:

On Tue\, Jul 17\, 2018 at 12:56 PM Deven T. Corzine \deven@ties\.org wrote:
On the other hand\, my patch does allow this example to match:
 print "matched\\n" if "ABCDA" =~ /^ $?&#8203;: \(\.$B | CD \)\* \\1 $/x;
Without my patch\, this matches instead:
 print "matched\\n" if "ABCDC" =~ /^ $?&#8203;: \(\.$B | CD \)\* \\1 $/x;
That optimization doesn't cause the bug\, it's the attempt to match the (.) again against "CD" that causes it -- the (.) matches\, but the "D" doesn't\, and it doesn't restore the original capture.
Deven
Thank you\, I misunderstood. So in the original demonstration\, the "b" got into $2 before the branch failed because the b was not followed by "foo"\, not due to $2 being internally tracked as an offset\, and as that branch had succeeded\, the capture was assignable.

Exactly! Now you've got it.

For reference\, my original example was: "afoobar" =~ /((.)foo|bar)*/

On the second iteration\, (.) does successfully match the "b" in "bar"\, so it saves that capture in $2 before the "foo" attempts to match against the remaining "ar"\, which obviously fails. Since the branch fails as a whole\, the capture of "b" is invalid\, yet it gets returned anyhow. This is the bug.

As the current documentation (the section on "Capture Groups" in https://perldoc.perl.org/perlre.html, accessed just now) states "If a group did not match\, the associated backreference won't match either. (This can happen if the group is optional\, or in a different branch of an alternation.) " there is clearly a bug somewhere. On the other hand\, as CDCDC fails to match the test while CBCDC (unsurprisingly\, but for surprising reason) does\, so there is some kind of "did this match" knowledge happening\, otherwise CDCDC would set \1 to C before failing to match the B\, and the implementation could be interpreted as conformant with the documentation's "if a group did not match" but it takes a lot of squinting.

In my "afoobar" example\, the inner group doesn't match successfully against "bar"\, so it shouldn't return "b" as it does. But that group DID successfully match "a"\, so it seems to me that it ought to be returned from the successful match. If the match text were "barbar"\, then $2 should be undef because the inner group wouldn't match at all.

The current documentation (that section) contains no guidance concerning capture groups within repeating constructs.

I think you're right about this. I've searched the documentation pretty carefully and haven't found anything that describes what happens when capture groups are repeated. I'm surprised by this\, but I can't find it anywhere.

Honestly\, before today I expected regex constructions like
           "abcdef" =~ /$?&#8203;:\(\.$\)\+/
to magically create $1 through $6 and load them all. This was erroneous! That's not how it works! The documentation is silent on the matter.

It's not silent about this\, it's in the sentence immediately before the one you quoted! "Groups are numbered with the leftmost open parenthesis being number 1\, etc." This only applies to parens used for capturing groups\, not other parens (e.g. non-capturing groups\, lookahead matches\, etc.) -- just count the open parens to find the group number. (The branch-reset construct (?|...) is an exception to this rule\, reusing the same group numbers for each branch.)

Since your example above only has a single capturing group (.)\, it's always $1 and there is no $2. When your non-capturing group repeats\, it just keeps replacing $1 with each letter\, returning "f" at the end.

As an opinionated person\, I'm in favor of fixing the regression and including

# we don't clobber capture groups with data from failed alternate branches (although we used to) ( "ABCD" =~ /^(?:(.)B|CD)*$/ and $1 eq ( $] ge '5.027' ? 'A' : 'C' ))

into the test suite and documenting how captures into buffers in alternations that passed in earlier iterations but not the most recent one used to work\, in perldoc perlre.

In the example above\, "C" is clearly the wrong result for $1\, since there is no "CB" in "ABCD". There should be no version check to whitewash the results -- this is a regression test and those older versions should FAIL this test because they DO contain this bug!

... After looking at https://rt-archive.perl.org/perl5/Ticket/Attachment/1566563/824618/perl-133136-test1.patch

I don't know how I ended up typing the wrong ticket number into that filename! Of course\, it should have been 133352\, not 133136...

I wonder if it might be possible to defer the assignment into the capture buffers until after branches have succeeded\, rather than resetting them.

I'm sure it's possible\, and I've been wondering if it would be better or worse.

This approach might require making a set of provisional capture buffers at every juncture that could become a descent into an iterating subregex containing captures\, but wouldn't be vulnerable to only operating correctly at the first level.

Yes\, it would require handling the possibility of multiple levels of provisional captures.

But maybe the engine already stacks these things so with the patch

$ perl -e ' "ABCDAFCDAD" =~ /(?:(?:(.)B|CD)+|(?:(.)D|A(.))*)+/ and print "$1

$2 > $3"' C > A > F

will do the right thing\, whatever that is.

The correct answer is A > A > F\, which is what it returns with my patch. The bug doesn't affect the other groups in this example.

I get where you're going with this. Here's an example that makes the point more strongly:

perl -MData::Dumper -e '$Data::Dumper::Varname=""; print Dumper("ABCDEFGHIJKLMHO" =~ /^(?:(.)(?:(.)C|DE)+FG|H(.)(?:(.)K|LM)+|.O)+$/);'

Perl currently returns the wrong values for ALL of these captures:

$1 = 'H'; $2 = 'D'; $3 = 'O'; $4 = 'L';

With my patch\, the correct values are returned:

$1 = 'A'; $2 = 'B'; $3 = 'I'; $4 = 'J';

With a bit more mangling\, I came up with this example as well:

perl -MData::Dumper -e '$Data::Dumper::Varname=""; print Dumper("ABCDEHIJKLMHO" =~ /^(?:(.)(?:(.)C|DE)+FG|H(.)(?:(.)K|LM)+|.O|.*?)+$/);'

Currently\, Perl only gets $2 correct for this example\, and the other captures are wrong:

$1 = 'H'; $2 = undef; $3 = 'O'; $4 = 'L';

Again\, with my patch\, the correct values are returned:

$1 = undef; $2 = undef; $3 = 'I'; $4 = 'J';

Isn't this fun? :)

Thank you

Thanks for participating in the discussion!

Deven

p5pRT commented 5 years ago

From @deven

On Tue\, Jul 17\, 2018 at 3:54 PM\, David Nicol \davidnicol@gmail\.com wrote:

On Tue\, Jul 17\, 2018 at 12:56 PM Deven T. Corzine \deven@ties\.org wrote:
On the other hand\, my patch does allow this example to match:
 print "matched\\n" if "ABCDA" =~ /^ $?&#8203;: \(\.$B | CD \)\* \\1 $/x;
Without my patch\, this matches instead:
 print "matched\\n" if "ABCDC" =~ /^ $?&#8203;: \(\.$B | CD \)\* \\1 $/x;
That optimization doesn't cause the bug\, it's the attempt to match the (.) again against "CD" that causes it -- the (.) matches\, but the "D" doesn't\, and it doesn't restore the original capture.
Deven
Thank you\, I misunderstood. So in the original demonstration\, the "b" got into $2 before the branch failed because the b was not followed by "foo"\, not due to $2 being internally tracked as an offset\, and as that branch had succeeded\, the capture was assignable.

Exactly! Now you've got it.

For reference\, my original example was: "afoobar" =~ /((.)foo|bar)*/

On the second iteration\, (.) does successfully match the "b" in "bar"\, so it saves that capture in $2 before the "foo" attempts to match against the remaining "ar"\, which obviously fails. Since the branch fails as a whole\, the capture of "b" is invalid\, yet it gets returned anyhow. This is the bug.

As the current documentation (the section on "Capture Groups" in https://perldoc.perl.org/perlre.html, accessed just now) states "If a group did not match\, the associated backreference won't match either. (This can happen if the group is optional\, or in a different branch of an alternation.) " there is clearly a bug somewhere. On the other hand\, as CDCDC fails to match the test while CBCDC (unsurprisingly\, but for surprising reason) does\, so there is some kind of "did this match" knowledge happening\, otherwise CDCDC would set \1 to C before failing to match the B\, and the implementation could be interpreted as conformant with the documentation's "if a group did not match" but it takes a lot of squinting.

In my "afoobar" example\, the inner group doesn't match successfully against "bar"\, so it shouldn't return "b" as it does. But that group DID successfully match "a"\, so it seems to me that it ought to be returned from the successful match. If the match text were "barbar"\, then $2 should be undef because the inner group wouldn't match at all.

The current documentation (that section) contains no guidance concerning capture groups within repeating constructs.

I think you're right about this. I've searched the documentation pretty carefully and haven't found anything that describes what happens when capture groups are repeated. I'm surprised by this\, but I can't find it anywhere.

Honestly\, before today I expected regex constructions like
           "abcdef" =~ /$?&#8203;:\(\.$\)\+/
to magically create $1 through $6 and load them all. This was erroneous! That's not how it works! The documentation is silent on the matter.

It's not silent about this\, it's in the sentence immediately before the one you quoted! "Groups are numbered with the leftmost open parenthesis being number 1\, etc." This only applies to parens used for capturing groups\, not other parens (e.g. non-capturing groups\, lookahead matches\, etc.) -- just count the open parens to find the group number. (The branch-reset construct (?|...) is an exception to this rule\, reusing the same group numbers for each branch.)

Since your example above only has a single capturing group (.)\, it's always $1 and there is no $2. When your non-capturing group repeats\, it just keeps replacing $1 with each letter\, returning "f" at the end.

As an opinionated person\, I'm in favor of fixing the regression and including

# we don't clobber capture groups with data from failed alternate branches (although we used to) ( "ABCD" =~ /^(?:(.)B|CD)*$/ and $1 eq ( $] ge '5.027' ? 'A' : 'C' ))

into the test suite and documenting how captures into buffers in alternations that passed in earlier iterations but not the most recent one used to work\, in perldoc perlre.

In the example above\, "C" is clearly the wrong result for $1\, since there is no "CB" in "ABCD". There should be no version check to whitewash the results -- this is a regression test and those older versions should FAIL this test because they DO contain this bug!

... After looking at https://rt-archive.perl.org/perl5/Ticket/Attachment/1566563/824618/perl-133136-test1.patch

I don't know how I ended up typing the wrong ticket number into that filename! Of course\, it should have been 133352\, not 133136...

I wonder if it might be possible to defer the assignment into the capture buffers until after branches have succeeded\, rather than resetting them.

I'm sure it's possible\, and I've been wondering if it would be better or worse.

This approach might require making a set of provisional capture buffers at every juncture that could become a descent into an iterating subregex containing captures\, but wouldn't be vulnerable to only operating correctly at the first level.

Yes\, it would require handling the possibility of multiple levels of provisional captures.

But maybe the engine already stacks these things so with the patch

$ perl -e ' "ABCDAFCDAD" =~ /(?:(?:(.)B|CD)+|(?:(.)D|A(.))*)+/ and print "$1

$2 > $3"' C > A > F

will do the right thing\, whatever that is.

The correct answer is A > A > F\, which is what it returns with my patch. The bug doesn't affect the other groups in this example.

I get where you're going with this. Here's an example that makes the point more strongly:

perl -MData::Dumper -e '$Data::Dumper::Varname=""; print Dumper("ABCDEFGHIJKLMHO" =~ /^(?:(.)(?:(.)C|DE)+FG|H(.)(?:(.)K|LM)+|.O)+$/);'

Perl currently returns the wrong values for ALL of these captures:

$1 = 'H'; $2 = 'D'; $3 = 'O'; $4 = 'L';

With my patch\, the correct values are returned:

$1 = 'A'; $2 = 'B'; $3 = 'I'; $4 = 'J';

With a bit more mangling\, I came up with this example as well:

perl -MData::Dumper -e '$Data::Dumper::Varname=""; print Dumper("ABCDEHIJKLMHO" =~ /^(?:(.)(?:(.)C|DE)+FG|H(.)(?:(.)K|LM)+|.O|.*?)+$/);'

Currently\, Perl only gets $2 correct for this example\, and the other captures are wrong:

$1 = 'H'; $2 = undef; $3 = 'O'; $4 = 'L';

Again\, with my patch\, the correct values are returned:

$1 = undef; $2 = undef; $3 = 'I'; $4 = 'J';

Isn't this fun? :)

Thank you

Thanks for participating in the discussion!

Deven

p5pRT commented 5 years ago

From @deven

On Wed\, Jul 18\, 2018 at 4:23 AM\, Dave Mitchell \davem@iabyn\.com wrote:

On Tue\, Jul 17\, 2018 at 11:42:11AM -0400\, Deven T. Corzine wrote:
On Tue\, Jul 17\, 2018 at 11:02 AM\, Dave Mitchell \davem@iabyn\.com wrote:

So the real question is\, at the end of an alternation\, should any 'unused' captures within the alternation be flagged as invalid\, or should they preserved - retaining the values they had at the start of the alternation (which may be real values if this is a second+ iteration of an enclosing '*' etc).

Indeed\, and there is certainly room for debate here\, as both arguments are reasonable.

Personally\, invalidating a successful capture that was part of a successful match feels wrong to me. Consider this example:
 "foobar" =~ /^ $?&#8203;: \(foo$ | $bar$ \)\* $/x;
Both with and without my patch\, this leaves $1="foo" and $2="bar". If the "unused" capture were to be invalidated\, this would leave $1 undefined instead. Would this be desirable?
It would match the (admittedly ambiguous) bit in perlre that David Nicol quoted:
If a group did not match\, the associated backreference won't match
either\. $This can happen if the group is optional\, or in a different
branch of an alternation\.$

This says "if a group DID NOT match" -- but in the example above\, the "(foo)" group DID match against the "foo" in "foobar"\, and the "(bar)" group DID match against the "bar" in "foobar"\, so how does this quote apply here?

Yes\, that's what my current patch does\, and there IS a performance issue here\, since I'm currently saving and restoring ALL captures\, whether or not they're in the alternation. This is unnecessary\, but I haven't figured out yet how to set the paren floor correctly to only save the necessary ones. I think that would mitigate most of the performance impact\, though not all of it.

There is considerable performance overhead in saving even just one set of capture indices - the marginal cost of saving more than one is less. So saving fewer is good\, saving none is a *lot* better.

I can totally see that. I have several ideas on possible ways to solve this; do you have any thoughts on the best approach?

The only ones needing saving or invalidating are the ones with indices lying between lastopen+1 .. maxopenparen (I think\, based on a quick look).

Maybe I'm misunderstanding\, but isn't lastopen going to point at the group inside the branch after the first time it gets captured?

My current thinking is to cache the initial value of maxopenparen the first time a BRANCH or TRIE is executed\, and use that as the paren floor for future iterations.

Do you have any suggestions on how to cache that value? I haven't thought of a good solution yet.

It might be worth writing an alternative patch which does just the invalidation\, rather than saving/restoring\, and see what\, if any\, tests fail. Those failures may give more insight into original intent\, and whether saving is worth it.

I suspect that writing the invalidation code might be quite tricky (in the sense that the eventual code will be simple\, but working out what that code should be exactly may be hard).

Invalidating the captures would have less performance impact\, but it would be a significant change to the regex semantics\, since there could be a lot of code out there relying on the captures to remain valid. This bug escaped notice for 24 years\, but invalidating the captures could break working code that would be unaffected by saving and restoring the captures\, as my patch does.

Consider this arbitrary example:

perl -MData::Dumper -e '$Data::Dumper::Varname=""; print Dumper("135813885" =~ /^(?:(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(0)|.)*$/);'

This has separate captures for each ASCII digit which can be used to find out whether or not each digit exists in the string:

$1 = '1'; $2 = undef; $3 = '3'; $4 = undef; $5 = '5'; $6 = undef; $7 = undef; $8 = '8'; $9 = undef; $10 = undef;

If you invalidate the captures every time the branch iterates\, only the capture for the last digit ($5) would be set here\, and all the rest would be undef. On the other hand\, saving and restoring the captures doesn't change the behavior of this example at all.

I realize that digits are a little simplistic\, but a similar approach could be used to search for specific typographical errors in a block of text: /^(?:(\bTHe\b)|(\bteh\b)|.*?)*$/

Would it be better to remember the captures some other way? Perl 6 returns ALL the captures; should Perl 5 have that capability too?

What do you mean exactly?

Perl 6 returns nested Match objects for nested captures and arrays for quantified captures -- it never clobbers one capture with another the way most regex engines do.

For my "afoobar" =~ /((.)foo|bar)*/ example\, during the second iteration\, Perl 5 re-captures "bar" to $1\, clobbering the original value of "afoo" from the first iteration. Perl 6 returns both captures instead:

"afoobar" ~~ /((.)foo|bar)*/ ｢afoobar｣ 0 => ｢afoo｣ 0 => ｢a｣ 0 => ｢bar｣

Here is the actual object structure returned in Perl 6:

("afoobar" ~~ /((.)foo|bar)*/).perl Match.new(made => Any\, pos => 7\, orig => "afoobar"\, hash => Map.new(())\, from => 0\, list => ([Match.new(made => Any\, pos => 4\, orig => "afoobar"\, hash => Map.new(())\, from => 0\, list => (Match.new(made => Any\, pos => 1\, orig => "afoobar"\, hash => Map.new(())\, from => 0\, list => ())\,))\, Match.new(made => Any\, pos => 7\, orig => "afoobar"\, hash => Map.new(())\, from => 4\, list => ())]\,))

This could be approximated with basic Perl 5 data types like this perhaps:

# Top-level array corresponding to top-level captures. [ # Possibly include the overall regex match ($&) here? "afoobar"\,

# Array for the outer group ($1)\, because it is quantified. [ # Arrays for each iteration because there are also nested captures...

# Array for captures from the first iteration of the outer group ($1). [ # Capture of outer group ($1) from first iteration. "afoo"\,

# Capture of inner group ($2) from first iteration. "a"\, ]\,

# Array for captures from the second iteration of the outer group ($1). [ # Capture of outer group ($1) from second iteration. "bar"\,

# Capture of inner group ($2) from second iteration\, or omit this perhaps? undef\, ]\,

# No more iterations of the outer group ($1) match successfully. ]\,

# No more top-level captures in the regex. ]

This example is for illustrative purposes; such a data structure is subject to bikeshedding\, of course. If there are named captures\, they should probably be in a hash somewhere\, for example. That's not important at the moment.

My basic question was whether or not Perl 5 ought to have this capability to retain the "afoo" match instead of always clobbering quantified captures -- perhaps only when enabled with a flag on the regex match\, for efficiency?

What do you think?