Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.85k stars 527 forks source link

Ancient Regex Regression #16616

Open p5pRT opened 5 years ago

p5pRT commented 5 years ago

Migrated from rt.perl.org#133352 (status was 'open')

Searchable as RT133352$

p5pRT commented 5 years ago

From @deven

Created by @deven

I discovered a bug in Perl's regular expression engine a few months ago. I showed it to many people at The Perl Conference in Salt Lake City a couple weeks ago\, and everyone agreed that this was a bug in the regex engine in Perl itself\, including Abigail\, Tom Christiansen\, Karl Williamson and Larry Wall.

I even ended up doing a lightning talk about the bug​:

  https://www.youtube.com/watch?v=U-JhPIECkPY

This was my test case\, which works with or without anchors​:

  "afoobar" =~ /((.)foo|bar)*/   "afoobar" =~ /^((.)foo|bar)*$/

Or\, as a standalone command​:

  perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

This prints "b"\, even though "bfoo" never appears in "afoobar"!

I understand why this is happening -- the inner group does match against "b" in "bar" on the second iteration\, but this branch of the alternation fails. The capture is still being used\, despite the fact that it came from a failed branch of the alternation.

The correct answer seems to be "a"\, since that's the last match of the inner group and the overall match is successful. Perl 1.0 can't handle this regex (Larry said it was the regex engine from Gosling Emacs)\, but Perl 2.0 through Perl 5.0 alpha 9 all print "a" for the command above. Other regex implementations\, such as PCRE\, RE2\, GNU and others\, also return "a" for the inner group.

Perl 5.000 (from 1994) is the first commit in the git repository (commit a0d0e21ea6ea90a22318550944fe6cb09ae10cda) which exhibits the bug\, printing "b" instead of "a". I just built blead again today and confirmed that the bug is still there\, despite passing the full test suite. (Tom Christiansen pointed out that this bug is technically a regression\, since it used to work correctly.)

Even though I have never worked on the Perl core\, and I've been warned that the regex engine is particularly difficult\, I would still like to attempt to develop a patch for this bug myself.

I've already managed to create a working patch that fixes this bug without breaking any of the regular expression tests in the test suite\, so I think I'm on the right track\, but I think there may be a few edge cases to consider\, so I'm not ready to submit the patch just yet.

Yves\, SawyerX thought you might be willing to mentor me on this? If so\, that would be great!

My solution involves saving the captures with regcppush() on BRANCH and TRIE nodes and restoring them with regcp_restore() at BRANCH_next_fail and TRIE_next_fail. Does that sound like the right approach\, give or take?

Perl Info ``` Flags: category=core severity=low Site configuration information for perl 5.29.1: Configured by deven at Sat Jul 7 22:07:54 EDT 2018. Summary of my perl5 (revision 5 version 29 subversion 1) configuration: Commit id: 71525f77826ad33944c007b06b68a1f14a085e7a Platform: osname=linux osvers=3.19.8-100.fc20.x86_64 archname=x86_64-linux-thread-multi uname='linux twist.ties.org 3.19.8-100.fc20.x86_64 #1 smp tue may 12 17:08:50 utc 2015 x86_64 x86_64 x86_64 gnulinux ' config_args='' hint=previous useposix=true d_sigaction=define useithreads=define usemultiplicity=define use64bitint=define use64bitall=define uselongdouble=undef usemymalloc=n default_inc_excludes_dot=define bincompat5005=undef Compiler: cc='gcc' ccflags ='-DDEBUGGING -D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64' optimize='-g' cppflags='-DDEBUGGING -D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include' ccversion='' gccversion='4.8.3 20140911 (Red Hat 4.8.3-7)' gccosandvers='' intsize=4 longsize=8 ptrsize=8 doublesize=8 byteorder=12345678 doublekind=3 d_longlong=define longlongsize=8 d_longdbl=define longdblsize=16 longdblkind=3 ivtype='long' ivsize=8 nvtype='double' nvsize=8 Off_t='off_t' lseeksize=8 alignbytes=8 prototype=define Linker and Libraries: ld='gcc' ldflags =' -fstack-protector-strong -L/usr/local/lib' libpth=/usr/local/lib /usr/lib /lib/../lib64 /usr/lib/../lib64 /lib /lib64 /usr/lib64 /usr/local/lib64 /usr/local/lib /usr/lib libs=-lpthread -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat perllibs=-lpthread -lnsl -ldl -lm -lcrypt -lutil -lc libc=libc-2.18.so so=so useshrplib=false libperl=libperl.a gnulibc_version='2.18' Dynamic Linking: dlsrc=dl_dlopen.xs dlext=so d_dlsymun=undef ccdlflags='-Wl,-E' cccdlflags='-fPIC' lddlflags='-shared -g -L/usr/local/lib -fstack-protector-strong' @INC for perl 5.29.1: lib /usr/local/lib/perl5/site_perl/5.29.1/x86_64-linux-thread-multi /usr/local/lib/perl5/site_perl/5.29.1 /usr/local/lib/perl5/5.29.1/x86_64-linux-thread-multi /usr/local/lib/perl5/5.29.1 Environment for perl 5.29.1: HOME=/home/deven LANG=en_US.UTF-8 LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/deven/bin:/home/deven/scripts:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/etc:/sbin:/home/deven/bin PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 5 years ago

From @jkeenan

On Mon\, 09 Jul 2018 04​:04​:32 GMT\, dcorzine wrote​:

This is a bug report for perl from deven@​ties.org\, generated with the help of perlbug 1.41 running under perl 5.29.1.

----------------------------------------------------------------- [Please describe your issue here]

I discovered a bug in Perl's regular expression engine a few months ago. I showed it to many people at The Perl Conference in Salt Lake City a couple weeks ago\, and everyone agreed that this was a bug in the regex engine in Perl itself\, including Abigail\, Tom Christiansen\, Karl Williamson and Larry Wall.

I even ended up doing a lightning talk about the bug​:

https://www.youtube.com/watch?v=U-JhPIECkPY

This was my test case\, which works with or without anchors​:

"afoobar" =~ /((.)foo|bar)*/ "afoobar" =~ /^((.)foo|bar)*$/

Or\, as a standalone command​:

perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

This prints "b"\, even though "bfoo" never appears in "afoobar"!

I understand why this is happening -- the inner group does match against "b" in "bar" on the second iteration\, but this branch of the alternation fails. The capture is still being used\, despite the fact that it came from a failed branch of the alternation.

The correct answer seems to be "a"\, since that's the last match of the inner group and the overall match is successful. Perl 1.0 can't handle this regex (Larry said it was the regex engine from Gosling Emacs)\, but Perl 2.0 through Perl 5.0 alpha 9 all print "a" for the command above. Other regex implementations\, such as PCRE\, RE2\, GNU and others\, also return "a" for the inner group.

Perl 5.000 (from 1994) is the first commit in the git repository (commit a0d0e21ea6ea90a22318550944fe6cb09ae10cda) which exhibits the bug\, printing "b" instead of "a". I just built blead again today and confirmed that the bug is still there\, despite passing the full test suite. (Tom Christiansen pointed out that this bug is technically a regression\, since it used to work correctly.)

Even though I have never worked on the Perl core\, and I've been warned that the regex engine is particularly difficult\, I would still like to attempt to develop a patch for this bug myself.

I've already managed to create a working patch that fixes this bug without breaking any of the regular expression tests in the test suite\, so I think I'm on the right track\, but I think there may be a few edge cases to consider\, so I'm not ready to submit the patch just yet.

Please submit the patch as an attachment. Even if it's not complete\, it gives us something to run through the test suite and a starting point for discussion.

Yves\, SawyerX thought you might be willing to mentor me on this? If so\, that would be great!

My solution involves saving the captures with regcppush() on BRANCH and TRIE nodes and restoring them with regcp_restore() at BRANCH_next_fail and TRIE_next_fail. Does that sound like the right approach\, give or take?

Thank you very much.

-- James E Keenan (jkeenan@​cpan.org)

p5pRT commented 5 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 5 years ago

From @deven

On Tue\, Jul 10\, 2018 at 1​:09 PM\, James E Keenan via RT \< perlbug-followup@​perl.org> wrote​:

On Mon\, 09 Jul 2018 04​:04​:32 GMT\, dcorzine wrote​:

This is a bug report for perl from deven@​ties.org\, generated with the help of perlbug 1.41 running under perl 5.29.1.

[...]

Please submit the patch as an attachment. Even if it's not complete\, it gives us something to run through the test suite and a starting point for discussion.

Fair enough. I'll do that now.

Deven

p5pRT commented 5 years ago

From @deven

On Tue\, Jul 10\, 2018 at 1​:09 PM\, James E Keenan via RT \< perlbug-followup@​perl.org> wrote​:

On Mon\, 09 Jul 2018 04​:04​:32 GMT\, dcorzine wrote​:

This is a bug report for perl from deven@​ties.org\, generated with the help of perlbug 1.41 running under perl 5.29.1.

[...]

Please submit the patch as an attachment. Even if it's not complete\, it gives us something to run through the test suite and a starting point for discussion.

Fair enough. I'll do that now.

Deven

p5pRT commented 5 years ago

From @hvds

On Sun\, 08 Jul 2018 21​:04​:32 -0700\, dcorzine wrote​:

This was my test case\, which works with or without anchors​:

"afoobar" =~ /((.)foo|bar)*/ "afoobar" =~ /^((.)foo|bar)*$/

Or\, as a standalone command​:

perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

This prints "b"\, even though "bfoo" never appears in "afoobar"! [...] The correct answer seems to be "a"\, since that's the last match of the inner group and the overall match is successful.

It's by no means clear to me that that must be the correct answer; the other candidate worth considering is that\, since in /(...)*/ we return the match for the last iteration of the group\, you'd expect further captures embedded within there also to deliver the version from the last iteration of the group.

Under that interpretation\, the correct answer would be undef\, and the earlier releases that returned 'a' were merely more subtly wrong than the current release.

The fact that other examples don't match such an interpretation argues against it​: % perl -wle 'use Data​::Dumper; print Dumper([ "foobar" =~ /((foo)|(bar))*/ ])' $VAR1 = [   'bar'\,   'foo'\,   'bar'   ]; % .. but I don't recall seeing docs to justify such results. I'll have to have another wade through them.

Hugo

p5pRT commented 5 years ago

From @deven

This is a first pass at a patch to fix RT #133352\, to save and restore all captures during alternations (BRANCH and TRIE) to avoid incorrectly returning a capture from a failed branch\, as happens with the following test case\, which has been broken ever since Perl 5.000 in 1994​:

  perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command\, while Perl 5.000 through today's blead all incorrectly print "b" instead.

This patch appears to work (making the test case above print "a" again)\, and I am submitting it as a preliminary patch for discussion purposes; I don't consider it ready to apply yet\, for a number of reasons​:

* I haven't finished analyzing the code to check for edge cases that   I may have overlooked with this patch. Since TRIE_next_fail included   a test for ST.jump which would call UNWIND_PAREN()\, I want to better   understand how that was working and determine whether ST.jump being   set might have any implications for this patch.

* I haven't developed any new regression tests to exercise this bug and   the patch to fix it\, and I know that both BRANCH and TRIE variations   need to be tested.

* This patch is currently saving ALL captures\, including captures that   can't possibly change inside the alternation. This is an easier fix\,   but I believe the performance impact should be mitigated\, since this   would impact many everyday regular expressions which aren't affected   by this bug in the first place.

* I believe that setting a paren floor (which CURLYX does) would likely   mitigate most (but not all) of the performance impact of this fix\,   but I haven't yet found a clean way to implement this.

* Another possibility might be to create new regnode types to use only   for alternations containing capture groups\, maybe only for ones which   also have a quantifier applied\, since I believe both are required to   trigger this bug. (This might have to be treated as an optimization   in regcomp.c to use the old logic for alternations that can't trigger   the bug\, but it might be hard to tell if a quantifier is applied for   qr// cases.)

* Regardless\, I would like feedback from an expert on this part of the   regular expression engine (maybe Yves?) about this patch and whether   the approach I've taken is on the right track.


regexec.c | 28 ++++++++++++++++------------ regexp.h | 3 +++ 2 files changed\, 19 insertions(+)\, 12 deletions(-)

p5pRT commented 5 years ago

From @deven

perl-133136-test1.patch ```diff From 1a53f4dc4ab5032ea9cd0bf254adc0902695daa4 Mon Sep 17 00:00:00 2001 From: "Deven T. Corzine" Date: Sat, 23 Jun 2018 23:17:03 -0400 Subject: [PATCH] Save and restore captures for BRANCH/TRIE failures This is a first pass at a patch to fix RT #133352, to save and restore all captures during alternations (BRANCH and TRIE) to avoid incorrectly returning a capture from a failed branch, as happens with the following test case, which has been broken ever since Perl 5.000 in 1994: perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;' Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command, while Perl 5.000 through today's blead all incorrectly print "b" instead. This patch appears to work (making the test case above print "a" again), and I am submitting it as a preliminary patch for discussion purposes; I don't consider it ready to apply yet, for a number of reasons: * I haven't finished analyzing the code to check for edge cases that I may have overlooked with this patch. Since TRIE_next_fail included a test for ST.jump which would call UNWIND_PAREN(), I want to better understand how that was working and determine whether ST.jump being set might have any implications for this patch. * I haven't developed any new regression tests to exercise this bug and the patch to fix it, and I know that both BRANCH and TRIE variations need to be tested. * This patch is currently saving ALL captures, including captures that can't possibly change inside the alternation. This is an easier fix, but I believe the performance impact should be mitigated, since this would impact many everyday regular expressions which aren't affected by this bug in the first place. * I believe that setting a paren floor (which CURLYX does) would likely mitigate most (but not all) of the performance impact of this fix, but I haven't yet found a clean way to implement this. * Another possibility might be to create new regnode types to use only for alternations containing capture groups, maybe only for ones which also have a quantifier applied, since I believe both are required to trigger this bug. (This might have to be treated as an optimization in regcomp.c to use the old logic for alternations that can't trigger the bug, but it might be hard to tell if a quantifier is applied for qr// cases.) * Regardless, I would like feedback from an expert on this part of the regular expression engine (maybe Yves?) about this patch and whether the approach I've taken is on the right track. --- regexec.c | 28 ++++++++++++++++------------ regexp.h | 3 +++ 2 files changed, 19 insertions(+), 12 deletions(-) diff --git a/regexec.c b/regexec.c index ba52ae9..1d42b0d 100644 --- a/regexec.c +++ b/regexec.c @@ -5920,6 +5920,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) HV * widecharmap = MUTABLE_HV(rexi->data->data[ ARG( scan ) + 1 ]); U32 state = trie->startstate; + /* save state of captures at branch start */ + ST.cp = regcppush(rex, 0, maxopenparen); + REGCP_SET(ST.lastcp); + if (scan->flags == EXACTL || scan->flags == EXACTFLU8) { _CHECK_AND_WARN_PROBLEMATIC_LOCALE; if (utf8_target @@ -6067,15 +6071,10 @@ S_regmatch(pTHX_ regmatch_info *reginfo, char *startpos, regnode *prog) case TRIE_next_fail: /* we failed - try next alternative */ { U8 *uc; - if ( ST.jump ) { - /* undo any captures done in the tail part of a branch, - * e.g. - * /(?:X(.)(.)|Y(.)).../ - * where the trie just matches X then calls out to do the - * rest of the branch */ - REGCP_UNWIND(ST.cp); - UNWIND_PAREN(ST.lastparen, ST.lastcloseparen); - } + + /* restore state of captures from branch start */ + regcp_restore(rex, ST.lastcp, &maxopenparen); + if (!--ST.accepted) { DEBUG_EXECUTE_r({ Perl_re_exec_indentf( aTHX_ "%sTRIE failed...%s\n", @@ -8043,7 +8042,10 @@ NULL ST.lastparen = rex->lastparen; ST.lastcloseparen = rex->lastcloseparen; ST.next_branch = next; - REGCP_SET(ST.cp); + + /* save state of captures at branch start */ + ST.cp = regcppush(rex, 0, maxopenparen); + REGCP_SET(ST.lastcp); /* Now go into the branch */ if (has_cutgroup) { @@ -8077,8 +8079,10 @@ NULL do_cutgroup = 0; no_final = 0; } - REGCP_UNWIND(ST.cp); - UNWIND_PAREN(ST.lastparen, ST.lastcloseparen); + + /* restore state of captures from branch start */ + regcp_restore(rex, ST.lastcp, &maxopenparen); + scan = ST.next_branch; /* no more branches? */ if (!scan || (OP(scan) != BRANCH && OP(scan) != BRANCHJ)) { diff --git a/regexp.h b/regexp.h index 44409f0..aa61c97 100644 --- a/regexp.h +++ b/regexp.h @@ -754,6 +754,7 @@ typedef struct regmatch_state { U32 lastparen; U32 lastcloseparen; CHECKPOINT cp; + CHECKPOINT lastcp; } branchlike; @@ -763,6 +764,7 @@ typedef struct regmatch_state { U32 lastparen; U32 lastcloseparen; CHECKPOINT cp; + CHECKPOINT lastcp; regnode *next_branch; /* next branch node */ } branch; @@ -773,6 +775,7 @@ typedef struct regmatch_state { U32 lastparen; U32 lastcloseparen; CHECKPOINT cp; + CHECKPOINT lastcp; U32 accepted; /* how many accepting states left */ bool longfold;/* saw a fold with a 1->n char mapping */ -- 2.7.4 ```
p5pRT commented 5 years ago

From @deven

On Tue\, 10 Jul 2018 12​:19​:30 -0700\, hv wrote​:

On Sun\, 08 Jul 2018 21​:04​:32 -0700\, dcorzine wrote​:

This was my test case\, which works with or without anchors​:

"afoobar" =~ /((.)foo|bar)*/ "afoobar" =~ /^((.)foo|bar)*$/

Or\, as a standalone command​:

perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

This prints "b"\, even though "bfoo" never appears in "afoobar"! [...] The correct answer seems to be "a"\, since that's the last match of the inner group and the overall match is successful.

It's by no means clear to me that that must be the correct answer; the other candidate worth considering is that\, since in /(...)*/ we return the match for the last iteration of the group\, you'd expect further captures embedded within there also to deliver the version from the last iteration of the group.

Under that interpretation\, the correct answer would be undef\, and the earlier releases that returned 'a' were merely more subtly wrong than the current release.

Yeah\, that's why I said the correct "seems" to be "a". There's a decent argument for returning undef\, and it's certainly counter-intuitive to some degree to have $1="bar" from the second iteration and $2="a" from the first iteration\, but the inner group does successfully match during the first iteration only\, so "a" is indeed the last successful match.

I see two different viewpoints here\, and both are quite reasonable. One viewpoint is that these are nested groups and both should be returning results from the same iteration\, and therefore $2 should be undef because the nested match doesn't match anything on that iteration. Intuitively\, this feels right. The other viewpoint is that each group can match multiple times and we only get to keep one capture per group\, so $2 should be "a" since that's the last successful match for that group\, despite the confusion of matching $1 and $2 from different iterations.

Personally\, the $2="a" viewpoint seems like a stronger argument to me\, but I could be in the minority in thinking that. I like the $2=undef viewpoint too\, but we can't have both. In terms of how regular expressions are defined and documented\, I'm having trouble rationalizing $2=undef even though it sounds good.

Is there anything definitive in the documentation that would resolve the question without ambiguity? I haven't found it yet\, if there is.

For what it's worth\, other regular expression engines like PCRE\, RE2\, GNU and others all return "a"\, which seems to suggest there may be some sort of consensus that "a" is the correct answer\, but maybe it's just a gray area with no definite right answer?

The fact that other examples don't match such an interpretation argues against it​: % perl -wle 'use Data​::Dumper; print Dumper([ "foobar" =~ /((foo)|(bar))*/ ])' $VAR1 = [ 'bar'\, 'foo'\, 'bar' ]; % .. but I don't recall seeing docs to justify such results. I'll have to have another wade through them.

Hugo

That example is certainly returning $2 and $3 from different loop iterations\, and it's less clear in this case that returning $2=undef would somehow be preferable or more intuitive for this example. Is that what you were getting at?

Deven

p5pRT commented 5 years ago

From @deven

[I originally entered this via RT\, but I didn't see it come though p5p\, so I'm resending directly\, hopefully it's not a duplicate!]

On Tue\, 10 Jul 2018 12​:19​:30 -0700\, hv wrote​:

On Sun\, 08 Jul 2018 21​:04​:32 -0700\, dcorzine wrote​:

This was my test case\, which works with or without anchors​:

"afoobar" =~ /((.)foo|bar)*/ "afoobar" =~ /^((.)foo|bar)*$/

Or\, as a standalone command​:

perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'

This prints "b"\, even though "bfoo" never appears in "afoobar"! [...] The correct answer seems to be "a"\, since that's the last match of the inner group and the overall match is successful.

It's by no means clear to me that that must be the correct answer; the other candidate worth considering is that\, since in /(...)*/ we return the match for the last iteration of the group\, you'd expect further captures embedded within there also to deliver the version from the last iteration of the group.

Under that interpretation\, the correct answer would be undef\, and the earlier releases that returned 'a' were merely more subtly wrong than the current release.

Yeah\, that's why I said the correct "seems" to be "a". There's a decent argument for returning undef\, and it's certainly counter-intuitive to some degree to have $1="bar" from the second iteration and $2="a" from the first iteration\, but the inner group does successfully match during the first iteration only\, so "a" is indeed the last successful match.

I see two different viewpoints here\, and both are quite reasonable. One viewpoint is that these are nested groups and both should be returning results from the same iteration\, and therefore $2 should be undef because the nested match doesn't match anything on that iteration. Intuitively\, this feels right. The other viewpoint is that each group can match multiple times and we only get to keep one capture per group\, so $2 should be "a" since that's the last successful match for that group\, despite the confusion of matching $1 and $2 from different iterations.

Personally\, the $2="a" viewpoint seems like a stronger argument to me\, but I could be in the minority in thinking that. I like the $2=undef viewpoint too\, but we can't have both. In terms of how regular expressions are defined and documented\, I'm having trouble rationalizing $2=undef even though it sounds good.

Is there anything definitive in the documentation that would resolve the question without ambiguity? I haven't found it yet\, if there is.

For what it's worth\, other regular expression engines like PCRE\, RE2\, GNU and others all return "a"\, which seems to suggest there may be some sort of consensus that "a" is the correct answer\, but maybe it's just a gray area with no definite right answer?

The fact that other examples don't match such an interpretation argues against it​: % perl -wle 'use Data​::Dumper; print Dumper([ "foobar" =~ /((foo)|(bar))*/ ])' $VAR1 = [ 'bar'\, 'foo'\, 'bar' ]; % .. but I don't recall seeing docs to justify such results. I'll have to have another wade through them.

Hugo

That example is certainly returning $2 and $3 from different loop iterations\, and it's less clear in this case that returning $2=undef would somehow be preferable or more intuitive for this example. Is that what you were getting at?

Deven

p5pRT commented 5 years ago

From @deven

Has anyone taken a look at this patch yet? I'd like to hear what others think so far\, thanks!

Deven

On Tue\, Jul 10\, 2018 at 3​:19 PM\, Deven T. Corzine via RT \perlbug\-followup@&#8203;perl\.org wrote​:

This is a first pass at a patch to fix RT #133352\, to save and restore all captures during alternations (BRANCH and TRIE) to avoid incorrectly returning a capture from a failed branch\, as happens with the following test case\, which has been broken ever since Perl 5.000 in 1994​:

 perl \-e 'print "$2\\n" if "afoobar" =~ /^\(\(\.\)foo|bar\)\*$/;'

Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command\, while Perl 5.000 through today's blead all incorrectly print "b" instead.

This patch appears to work (making the test case above print "a" again)\, and I am submitting it as a preliminary patch for discussion purposes; I don't consider it ready to apply yet\, for a number of reasons​:

* I haven't finished analyzing the code to check for edge cases that I may have overlooked with this patch. Since TRIE_next_fail included a test for ST.jump which would call UNWIND_PAREN()\, I want to better understand how that was working and determine whether ST.jump being set might have any implications for this patch.

* I haven't developed any new regression tests to exercise this bug and the patch to fix it\, and I know that both BRANCH and TRIE variations need to be tested.

* This patch is currently saving ALL captures\, including captures that can't possibly change inside the alternation. This is an easier fix\, but I believe the performance impact should be mitigated\, since this would impact many everyday regular expressions which aren't affected by this bug in the first place.

* I believe that setting a paren floor (which CURLYX does) would likely mitigate most (but not all) of the performance impact of this fix\, but I haven't yet found a clean way to implement this.

* Another possibility might be to create new regnode types to use only for alternations containing capture groups\, maybe only for ones which also have a quantifier applied\, since I believe both are required to trigger this bug. (This might have to be treated as an optimization in regcomp.c to use the old logic for alternations that can't trigger the bug\, but it might be hard to tell if a quantifier is applied for qr// cases.)

* Regardless\, I would like feedback from an expert on this part of the regular expression engine (maybe Yves?) about this patch and whether the approach I've taken is on the right track. --- regexec.c | 28 ++++++++++++++++------------ regexp.h | 3 +++ 2 files changed\, 19 insertions(+)\, 12 deletions(-)

--- via perlbug​: queue​: perl5 status​: open https://rt-archive.perl.org/perl5/Ticket/Display.html?id=133352

From 1a53f4dc4ab5032ea9cd0bf254adc0902695daa4 Mon Sep 17 00​:00​:00 2001 From​: "Deven T. Corzine" \deven@&#8203;ties\.org Date​: Sat\, 23 Jun 2018 23​:17​:03 -0400 Subject​: [PATCH] Save and restore captures for BRANCH/TRIE failures

This is a first pass at a patch to fix RT #133352\, to save and restore all captures during alternations (BRANCH and TRIE) to avoid incorrectly returning a capture from a failed branch\, as happens with the following test case\, which has been broken ever since Perl 5.000 in 1994​:

 perl \-e 'print "$2\\n" if "afoobar" =~ /^\(\(\.\)foo|bar\)\*$/;'

Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command\, while Perl 5.000 through today's blead all incorrectly print "b" instead.

This patch appears to work (making the test case above print "a" again)\, and I am submitting it as a preliminary patch for discussion purposes; I don't consider it ready to apply yet\, for a number of reasons​:

* I haven't finished analyzing the code to check for edge cases that I may have overlooked with this patch. Since TRIE_next_fail included a test for ST.jump which would call UNWIND_PAREN()\, I want to better understand how that was working and determine whether ST.jump being set might have any implications for this patch.

* I haven't developed any new regression tests to exercise this bug and the patch to fix it\, and I know that both BRANCH and TRIE variations need to be tested.

* This patch is currently saving ALL captures\, including captures that can't possibly change inside the alternation. This is an easier fix\, but I believe the performance impact should be mitigated\, since this would impact many everyday regular expressions which aren't affected by this bug in the first place.

* I believe that setting a paren floor (which CURLYX does) would likely mitigate most (but not all) of the performance impact of this fix\, but I haven't yet found a clean way to implement this.

* Another possibility might be to create new regnode types to use only for alternations containing capture groups\, maybe only for ones which also have a quantifier applied\, since I believe both are required to trigger this bug. (This might have to be treated as an optimization in regcomp.c to use the old logic for alternations that can't trigger the bug\, but it might be hard to tell if a quantifier is applied for qr// cases.)

* Regardless\, I would like feedback from an expert on this part of the regular expression engine (maybe Yves?) about this patch and whether the approach I've taken is on the right track. --- regexec.c | 28 ++++++++++++++++------------ regexp.h | 3 +++ 2 files changed\, 19 insertions(+)\, 12 deletions(-)

diff --git a/regexec.c b/regexec.c index ba52ae9..1d42b0d 100644 --- a/regexec.c +++ b/regexec.c @​@​ -5920\,6 +5920\,10 @​@​ S_regmatch(pTHX_ regmatch_info *reginfo\, char *startpos\, regnode *prog) HV * widecharmap = MUTABLE_HV(rexi->data->data[ ARG( scan ) + 1 ]); U32 state = trie->startstate;

+ /* save state of captures at branch start */ + ST.cp = regcppush(rex\, 0\, maxopenparen); + REGCP_SET(ST.lastcp); + if (scan->flags == EXACTL || scan->flags == EXACTFLU8) { _CHECK_AND_WARN_PROBLEMATIC_LOCALE; if (utf8_target @​@​ -6067\,15 +6071\,10 @​@​ S_regmatch(pTHX_ regmatch_info *reginfo\, char *startpos\, regnode *prog) case TRIE_next_fail​: /* we failed - try next alternative */ { U8 *uc; - if ( ST.jump ) { - /* undo any captures done in the tail part of a branch\, - * e.g. - * /(?​:X(.)(.)|Y(.)).../ - * where the trie just matches X then calls out to do the - * rest of the branch */ - REGCP_UNWIND(ST.cp); - UNWIND_PAREN(ST.lastparen\, ST.lastcloseparen); - } + + /* restore state of captures from branch start */ + regcp_restore(rex\, ST.lastcp\, &maxopenparen); + if (!--ST.accepted) { DEBUG_EXECUTE_r({ Perl_re_exec_indentf( aTHX_ "%sTRIE failed...%s\n"\, @​@​ -8043\,7 +8042\,10 @​@​ NULL ST.lastparen = rex->lastparen; ST.lastcloseparen = rex->lastcloseparen; ST.next_branch = next; - REGCP_SET(ST.cp); + + /* save state of captures at branch start */ + ST.cp = regcppush(rex\, 0\, maxopenparen); + REGCP_SET(ST.lastcp);

        /\* Now go into the branch \*/
        if \(has\_cutgroup\) \{

@​@​ -8077\,8 +8079\,10 @​@​ NULL do_cutgroup = 0; no_final = 0; } - REGCP_UNWIND(ST.cp); - UNWIND_PAREN(ST.lastparen\, ST.lastcloseparen); + + /* restore state of captures from branch start */ + regcp_restore(rex\, ST.lastcp\, &maxopenparen); + scan = ST.next_branch; /* no more branches? */ if (!scan || (OP(scan) != BRANCH && OP(scan) != BRANCHJ)) { diff --git a/regexp.h b/regexp.h index 44409f0..aa61c97 100644 --- a/regexp.h +++ b/regexp.h @​@​ -754\,6 +754\,7 @​@​ typedef struct regmatch_state { U32 lastparen; U32 lastcloseparen; CHECKPOINT cp; + CHECKPOINT lastcp;

     \} branchlike;

@​@​ -763\,6 +764\,7 @​@​ typedef struct regmatch_state { U32 lastparen; U32 lastcloseparen; CHECKPOINT cp; + CHECKPOINT lastcp;

        regnode \*next\_branch; /\* next branch node \*/
    \} branch;

@​@​ -773\,6 +775\,7 @​@​ typedef struct regmatch_state { U32 lastparen; U32 lastcloseparen; CHECKPOINT cp; + CHECKPOINT lastcp;

        U32         accepted; /\* how many accepting states left \*/
        bool        longfold;/\* saw a fold with a 1\->n char mapping \*/

-- 2.7.4

p5pRT commented 5 years ago

From @deven

Has anyone taken a look at this patch yet? I'd like to hear what others think so far\, thanks!

Deven

On Tue\, Jul 10\, 2018 at 3​:19 PM\, Deven T. Corzine via RT \perlbug\-followup@&#8203;perl\.org wrote​:

This is a first pass at a patch to fix RT #133352\, to save and restore all captures during alternations (BRANCH and TRIE) to avoid incorrectly returning a capture from a failed branch\, as happens with the following test case\, which has been broken ever since Perl 5.000 in 1994​:

 perl \-e 'print "$2\\n" if "afoobar" =~ /^\(\(\.\)foo|bar\)\*$/;'

Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command\, while Perl 5.000 through today's blead all incorrectly print "b" instead.

This patch appears to work (making the test case above print "a" again)\, and I am submitting it as a preliminary patch for discussion purposes; I don't consider it ready to apply yet\, for a number of reasons​:

* I haven't finished analyzing the code to check for edge cases that I may have overlooked with this patch. Since TRIE_next_fail included a test for ST.jump which would call UNWIND_PAREN()\, I want to better understand how that was working and determine whether ST.jump being set might have any implications for this patch.

* I haven't developed any new regression tests to exercise this bug and the patch to fix it\, and I know that both BRANCH and TRIE variations need to be tested.

* This patch is currently saving ALL captures\, including captures that can't possibly change inside the alternation. This is an easier fix\, but I believe the performance impact should be mitigated\, since this would impact many everyday regular expressions which aren't affected by this bug in the first place.

* I believe that setting a paren floor (which CURLYX does) would likely mitigate most (but not all) of the performance impact of this fix\, but I haven't yet found a clean way to implement this.

* Another possibility might be to create new regnode types to use only for alternations containing capture groups\, maybe only for ones which also have a quantifier applied\, since I believe both are required to trigger this bug. (This might have to be treated as an optimization in regcomp.c to use the old logic for alternations that can't trigger the bug\, but it might be hard to tell if a quantifier is applied for qr// cases.)

* Regardless\, I would like feedback from an expert on this part of the regular expression engine (maybe Yves?) about this patch and whether the approach I've taken is on the right track. --- regexec.c | 28 ++++++++++++++++------------ regexp.h | 3 +++ 2 files changed\, 19 insertions(+)\, 12 deletions(-)

--- via perlbug​: queue​: perl5 status​: open https://rt-archive.perl.org/perl5/Ticket/Display.html?id=133352

From 1a53f4dc4ab5032ea9cd0bf254adc0902695daa4 Mon Sep 17 00​:00​:00 2001 From​: "Deven T. Corzine" \deven@&#8203;ties\.org Date​: Sat\, 23 Jun 2018 23​:17​:03 -0400 Subject​: [PATCH] Save and restore captures for BRANCH/TRIE failures

This is a first pass at a patch to fix RT #133352\, to save and restore all captures during alternations (BRANCH and TRIE) to avoid incorrectly returning a capture from a failed branch\, as happens with the following test case\, which has been broken ever since Perl 5.000 in 1994​:

 perl \-e 'print "$2\\n" if "afoobar" =~ /^\(\(\.\)foo|bar\)\*$/;'

Perl 2.0 through Perl 5.0 alpha 9 all print "a" for this command\, while Perl 5.000 through today's blead all incorrectly print "b" instead.

This patch appears to work (making the test case above print "a" again)\, and I am submitting it as a preliminary patch for discussion purposes; I don't consider it ready to apply yet\, for a number of reasons​:

* I haven't finished analyzing the code to check for edge cases that I may have overlooked with this patch. Since TRIE_next_fail included a test for ST.jump which would call UNWIND_PAREN()\, I want to better understand how that was working and determine whether ST.jump being set might have any implications for this patch.

* I haven't developed any new regression tests to exercise this bug and the patch to fix it\, and I know that both BRANCH and TRIE variations need to be tested.

* This patch is currently saving ALL captures\, including captures that can't possibly change inside the alternation. This is an easier fix\, but I believe the performance impact should be mitigated\, since this would impact many everyday regular expressions which aren't affected by this bug in the first place.

* I believe that setting a paren floor (which CURLYX does) would likely mitigate most (but not all) of the performance impact of this fix\, but I haven't yet found a clean way to implement this.

* Another possibility might be to create new regnode types to use only for alternations containing capture groups\, maybe only for ones which also have a quantifier applied\, since I believe both are required to trigger this bug. (This might have to be treated as an optimization in regcomp.c to use the old logic for alternations that can't trigger the bug\, but it might be hard to tell if a quantifier is applied for qr// cases.)

* Regardless\, I would like feedback from an expert on this part of the regular expression engine (maybe Yves?) about this patch and whether the approach I've taken is on the right track. --- regexec.c | 28 ++++++++++++++++------------ regexp.h | 3 +++ 2 files changed\, 19 insertions(+)\, 12 deletions(-)

diff --git a/regexec.c b/regexec.c index ba52ae9..1d42b0d 100644 --- a/regexec.c +++ b/regexec.c @​@​ -5920\,6 +5920\,10 @​@​ S_regmatch(pTHX_ regmatch_info *reginfo\, char *startpos\, regnode *prog) HV * widecharmap = MUTABLE_HV(rexi->data->data[ ARG( scan ) + 1 ]); U32 state = trie->startstate;

+ /* save state of captures at branch start */ + ST.cp = regcppush(rex\, 0\, maxopenparen); + REGCP_SET(ST.lastcp); + if (scan->flags == EXACTL || scan->flags == EXACTFLU8) { _CHECK_AND_WARN_PROBLEMATIC_LOCALE; if (utf8_target @​@​ -6067\,15 +6071\,10 @​@​ S_regmatch(pTHX_ regmatch_info *reginfo\, char *startpos\, regnode *prog) case TRIE_next_fail​: /* we failed - try next alternative */ { U8 *uc; - if ( ST.jump ) { - /* undo any captures done in the tail part of a branch\, - * e.g. - * /(?​:X(.)(.)|Y(.)).../ - * where the trie just matches X then calls out to do the - * rest of the branch */ - REGCP_UNWIND(ST.cp); - UNWIND_PAREN(ST.lastparen\, ST.lastcloseparen); - } + + /* restore state of captures from branch start */ + regcp_restore(rex\, ST.lastcp\, &maxopenparen); + if (!--ST.accepted) { DEBUG_EXECUTE_r({ Perl_re_exec_indentf( aTHX_ "%sTRIE failed...%s\n"\, @​@​ -8043\,7 +8042\,10 @​@​ NULL ST.lastparen = rex->lastparen; ST.lastcloseparen = rex->lastcloseparen; ST.next_branch = next; - REGCP_SET(ST.cp); + + /* save state of captures at branch start */ + ST.cp = regcppush(rex\, 0\, maxopenparen); + REGCP_SET(ST.lastcp);

        /\* Now go into the branch \*/
        if \(has\_cutgroup\) \{

@​@​ -8077\,8 +8079\,10 @​@​ NULL do_cutgroup = 0; no_final = 0; } - REGCP_UNWIND(ST.cp); - UNWIND_PAREN(ST.lastparen\, ST.lastcloseparen); + + /* restore state of captures from branch start */ + regcp_restore(rex\, ST.lastcp\, &maxopenparen); + scan = ST.next_branch; /* no more branches? */ if (!scan || (OP(scan) != BRANCH && OP(scan) != BRANCHJ)) { diff --git a/regexp.h b/regexp.h index 44409f0..aa61c97 100644 --- a/regexp.h +++ b/regexp.h @​@​ -754\,6 +754\,7 @​@​ typedef struct regmatch_state { U32 lastparen; U32 lastcloseparen; CHECKPOINT cp; + CHECKPOINT lastcp;

     \} branchlike;

@​@​ -763\,6 +764\,7 @​@​ typedef struct regmatch_state { U32 lastparen; U32 lastcloseparen; CHECKPOINT cp; + CHECKPOINT lastcp;

        regnode \*next\_branch; /\* next branch node \*/
    \} branch;

@​@​ -773\,6 +775\,7 @​@​ typedef struct regmatch_state { U32 lastparen; U32 lastcloseparen; CHECKPOINT cp; + CHECKPOINT lastcp;

        U32         accepted; /\* how many accepting states left \*/
        bool        longfold;/\* saw a fold with a 1\->n char mapping \*/

-- 2.7.4

p5pRT commented 5 years ago

From @iabyn

On Sun\, Jul 15\, 2018 at 03​:52​:30PM -0400\, Deven T. Corzine wrote​:

Has anyone taken a look at this patch yet? I'd like to hear what others think so far\, thanks!

I plan to look at it at some point\, but I hate and fear the capture save and restore code in regmatch()\, which has kept me away from looking at your patch yet. Maybe in a few days.

-- Art is anything that has a label (especially if the label is "untitled 1")

p5pRT commented 5 years ago

From @iabyn

On Sun\, Jul 15\, 2018 at 03​:52​:30PM -0400\, Deven T. Corzine wrote​:

Has anyone taken a look at this patch yet? I'd like to hear what others think so far\, thanks!

I plan to look at it at some point\, but I hate and fear the capture save and restore code in regmatch()\, which has kept me away from looking at your patch yet. Maybe in a few days.

-- Art is anything that has a label (especially if the label is "untitled 1")

p5pRT commented 5 years ago

From @deven

On Mon\, Jul 16\, 2018 at 4​:13 AM\, Dave Mitchell \davem@&#8203;iabyn\.com wrote​:

I plan to look at it at some point\, but I hate and fear the capture save and restore code in regmatch()\, which has kept me away from looking at your patch yet. Maybe in a few days.

I look forward to hearing your feedback\, thanks Dave! That's definitely complex code! It took quite a few hours of study to understand it well enough to make that patch\, though it's only a few lines long.

The patch still needs work -- right now it's saving and restoring ALL the captures\, which is overkill. There's no need to save and restore captures that are outside the alternation\, and that could affect the performance of many regexes which aren't affected by this bug in the first place.

My initial inclination was try setting a paren floor (which CURLYX does)\, but I haven't figured out a clean way to do that yet.

When a regmatch_state is allocated\, is the memory cleared or uninitialized?

Deven

p5pRT commented 5 years ago

From @deven

On Mon\, Jul 16\, 2018 at 4​:13 AM\, Dave Mitchell \davem@&#8203;iabyn\.com wrote​:

I plan to look at it at some point\, but I hate and fear the capture save and restore code in regmatch()\, which has kept me away from looking at your patch yet. Maybe in a few days.

I look forward to hearing your feedback\, thanks Dave! That's definitely complex code! It took quite a few hours of study to understand it well enough to make that patch\, though it's only a few lines long.

The patch still needs work -- right now it's saving and restoring ALL the captures\, which is overkill. There's no need to save and restore captures that are outside the alternation\, and that could affect the performance of many regexes which aren't affected by this bug in the first place.

My initial inclination was try setting a paren floor (which CURLYX does)\, but I haven't figured out a clean way to do that yet.

When a regmatch_state is allocated\, is the memory cleared or uninitialized?

Deven

p5pRT commented 5 years ago

From @iabyn

On Tue\, Jul 10\, 2018 at 02​:14​:03PM -0700\, Deven T. Corzine via RT wrote​:

Yeah\, that's why I said the correct "seems" to be "a". There's a decent argument for returning undef

My take on it is based on two (possibly conflicting) ideals.

The first is that when starting the (N+1)th iteration of a '*' expression or similar\, the captures from the Nth iteration are still valid\, until over-written by the (N+1)th iteration. This allows patterns of this form to work​:

  print "1=[$1]\n" if "AA-ABB-" =~ /^ (?​: \1? (\w) \1 - )* $/x

which outputs 'B'.

On the first iteration​:   the first \1 fails to match anything\, and is skipped via the '?';   the second \1 matches 'A'

On the second iteration​:   the first \1 matches 'A' - the value from the last iteration   the second \1 matches 'B' - the value from the current iteration

The second principle is that following an alternation\, the captures from branches that failed or weren't tried\, are invalid.

Neither of these match​:

  print "matched\n" if "B" =~ /^ (?​: A(.) | B ) \1 $/x;   print "matched\n" if "B" =~ /^ (?​: A | B | (.) ) \1 $/x;

So the real question is\, at the end of an alternation\, should any 'unused' captures within the alternation be flagged as invalid\, or should they preserved - retaining the values they had at the start of the alternation (which may be real values if this is a second+ iteration of an enclosing '*' etc).

I think you could argue it either way. However\, since this bug has been around since forever\, with no-one apparently noticing it before\, I think can fix it how we like - so we should pick whichever is easiest to implement.

Invalidating captures set by a failing branch involves just knowing the max index at the start and end of the branch execution\, and invalidating everything in between; restoring previous values involves saving a whole set of capture indices and restoring them on failure (which is what I think your patch does). The latter sounds a whole lot more expensive\, and would potentially slow down all alterations.

-- The Enterprise is involved in a bizarre time-warp experience which is in some way unconnected with the Late 20th Century.   -- Things That Never Happen in "Star Trek" #14

p5pRT commented 5 years ago

From @iabyn

On Tue\, Jul 10\, 2018 at 02​:14​:03PM -0700\, Deven T. Corzine via RT wrote​:

Yeah\, that's why I said the correct "seems" to be "a". There's a decent argument for returning undef

My take on it is based on two (possibly conflicting) ideals.

The first is that when starting the (N+1)th iteration of a '*' expression or similar\, the captures from the Nth iteration are still valid\, until over-written by the (N+1)th iteration. This allows patterns of this form to work​:

  print "1=[$1]\n" if "AA-ABB-" =~ /^ (?​: \1? (\w) \1 - )* $/x

which outputs 'B'.

On the first iteration​:   the first \1 fails to match anything\, and is skipped via the '?';   the second \1 matches 'A'

On the second iteration​:   the first \1 matches 'A' - the value from the last iteration   the second \1 matches 'B' - the value from the current iteration

The second principle is that following an alternation\, the captures from branches that failed or weren't tried\, are invalid.

Neither of these match​:

  print "matched\n" if "B" =~ /^ (?​: A(.) | B ) \1 $/x;   print "matched\n" if "B" =~ /^ (?​: A | B | (.) ) \1 $/x;

So the real question is\, at the end of an alternation\, should any 'unused' captures within the alternation be flagged as invalid\, or should they preserved - retaining the values they had at the start of the alternation (which may be real values if this is a second+ iteration of an enclosing '*' etc).

I think you could argue it either way. However\, since this bug has been around since forever\, with no-one apparently noticing it before\, I think can fix it how we like - so we should pick whichever is easiest to implement.

Invalidating captures set by a failing branch involves just knowing the max index at the start and end of the branch execution\, and invalidating everything in between; restoring previous values involves saving a whole set of capture indices and restoring them on failure (which is what I think your patch does). The latter sounds a whole lot more expensive\, and would potentially slow down all alterations.

-- The Enterprise is involved in a bizarre time-warp experience which is in some way unconnected with the Late 20th Century.   -- Things That Never Happen in "Star Trek" #14

p5pRT commented 5 years ago

From @deven

On Tue\, Jul 17\, 2018 at 11​:02 AM\, Dave Mitchell \davem@&#8203;iabyn\.com wrote​:

On Tue\, Jul 10\, 2018 at 02​:14​:03PM -0700\, Deven T. Corzine via RT wrote​:

Yeah\, that's why I said the correct "seems" to be "a". There's a decent argument for returning undef

My take on it is based on two (possibly conflicting) ideals.

The first is that when starting the (N+1)th iteration of a '*' expression or similar\, the captures from the Nth iteration are still valid\, until over-written by the (N+1)th iteration. This allows patterns of this form to work​:

print "1=\[$1\]\\n" if "AA\-ABB\-" =~ /^ \(?&#8203;: \\1? \(\\w\) \\1 \- \)\* $/x

which outputs 'B'.

On the first iteration​: the first \1 fails to match anything\, and is skipped via the '?'; the second \1 matches 'A'

On the second iteration​: the first \1 matches 'A' - the value from the last iteration the second \1 matches 'B' - the value from the current iteration

Interesting. It's not a forward reference\, but it's sort of an inverted backreference. I don't think I've ever seen that done before\, but it seems to me that this ought to be valid\, unusual though it is.

The second principle is that following an alternation\, the captures from branches that failed or weren't tried\, are invalid.

Neither of these match​:

print "matched\\n" if "B" =~ /^ \(?&#8203;: A\(\.\) | B \) \\1 $/x;
print "matched\\n" if "B" =~ /^ \(?&#8203;: A | B | \(\.\) \) \\1 $/x;

My patch doesn't change the behavior of those examples.

On the other hand\, my patch does allow this example to match​:

  print "matched\n" if "ABCDA" =~ /^ (?​: (.)B | CD )* \1 $/x;

Without my patch\, this matches instead​:

  print "matched\n" if "ABCDC" =~ /^ (?​: (.)B | CD )* \1 $/x;

So the real question is\, at the end of an alternation\, should any 'unused' captures within the alternation be flagged as invalid\, or should they preserved - retaining the values they had at the start of the alternation (which may be real values if this is a second+ iteration of an enclosing '*' etc).

Indeed\, and there is certainly room for debate here\, as both arguments are reasonable.

Personally\, invalidating a successful capture that was part of a successful match feels wrong to me. Consider this example​:

  "foobar" =~ /^ (?​: (foo) | (bar) )* $/x;

Both with and without my patch\, this leaves $1="foo" and $2="bar". If the "unused" capture were to be invalidated\, this would leave $1 undefined instead. Would this be desirable?

I think you could argue it either way. However\, since this bug has been around since forever\, with no-one apparently noticing it before\, I think can fix it how we like - so we should pick whichever is easiest to implement.

That's a good point. Still\, it's worth considering that restoring the original captures is what Perl used to do in Perl 2.0 through Perl 5.0 alpha 9\, and it's also what PCRE\, RE2 and other regex implementations do also. Javascript does seem to invalidate the capture instead\, but that's the only counter-example I've seen so far.

Strangely\, I can't find a single word in the documentation about what happens with captures when quantifiers are applied. We're used to the capture being replaced with the new one\, yet I can't find anything saying that it does that! All of it seems to be unspecified behavior. Is it somewhere I'm missing?

Invalidating captures set by a failing branch involves just knowing the max index at the start and end of the branch execution\, and invalidating everything in between; restoring previous values involves saving a whole set of capture indices and restoring them on failure (which is what I think your patch does). The latter sounds a whole lot more expensive\, and would potentially slow down all alterations.

Yes\, that's what my current patch does\, and there IS a performance issue here\, since I'm currently saving and restoring ALL captures\, whether or not they're in the alternation. This is unnecessary\, but I haven't figured out yet how to set the paren floor correctly to only save the necessary ones. I think that would mitigate most of the performance impact\, though not all of it.

Would it be better to remember the captures some other way? Perl 6 returns ALL the captures; should Perl 5 have that capability too?

Deven

p5pRT commented 5 years ago

From @deven

On Tue\, Jul 17\, 2018 at 11​:02 AM\, Dave Mitchell \davem@&#8203;iabyn\.com wrote​:

On Tue\, Jul 10\, 2018 at 02​:14​:03PM -0700\, Deven T. Corzine via RT wrote​:

Yeah\, that's why I said the correct "seems" to be "a". There's a decent argument for returning undef

My take on it is based on two (possibly conflicting) ideals.

The first is that when starting the (N+1)th iteration of a '*' expression or similar\, the captures from the Nth iteration are still valid\, until over-written by the (N+1)th iteration. This allows patterns of this form to work​:

print "1=\[$1\]\\n" if "AA\-ABB\-" =~ /^ \(?&#8203;: \\1? \(\\w\) \\1 \- \)\* $/x

which outputs 'B'.

On the first iteration​: the first \1 fails to match anything\, and is skipped via the '?'; the second \1 matches 'A'

On the second iteration​: the first \1 matches 'A' - the value from the last iteration the second \1 matches 'B' - the value from the current iteration

Interesting. It's not a forward reference\, but it's sort of an inverted backreference. I don't think I've ever seen that done before\, but it seems to me that this ought to be valid\, unusual though it is.

The second principle is that following an alternation\, the captures from branches that failed or weren't tried\, are invalid.

Neither of these match​:

print "matched\\n" if "B" =~ /^ \(?&#8203;: A\(\.\) | B \) \\1 $/x;
print "matched\\n" if "B" =~ /^ \(?&#8203;: A | B | \(\.\) \) \\1 $/x;

My patch doesn't change the behavior of those examples.

On the other hand\, my patch does allow this example to match​:

  print "matched\n" if "ABCDA" =~ /^ (?​: (.)B | CD )* \1 $/x;

Without my patch\, this matches instead​:

  print "matched\n" if "ABCDC" =~ /^ (?​: (.)B | CD )* \1 $/x;

So the real question is\, at the end of an alternation\, should any 'unused' captures within the alternation be flagged as invalid\, or should they preserved - retaining the values they had at the start of the alternation (which may be real values if this is a second+ iteration of an enclosing '*' etc).

Indeed\, and there is certainly room for debate here\, as both arguments are reasonable.

Personally\, invalidating a successful capture that was part of a successful match feels wrong to me. Consider this example​:

  "foobar" =~ /^ (?​: (foo) | (bar) )* $/x;

Both with and without my patch\, this leaves $1="foo" and $2="bar". If the "unused" capture were to be invalidated\, this would leave $1 undefined instead. Would this be desirable?

I think you could argue it either way. However\, since this bug has been around since forever\, with no-one apparently noticing it before\, I think can fix it how we like - so we should pick whichever is easiest to implement.

That's a good point. Still\, it's worth considering that restoring the original captures is what Perl used to do in Perl 2.0 through Perl 5.0 alpha 9\, and it's also what PCRE\, RE2 and other regex implementations do also. Javascript does seem to invalidate the capture instead\, but that's the only counter-example I've seen so far.

Strangely\, I can't find a single word in the documentation about what happens with captures when quantifiers are applied. We're used to the capture being replaced with the new one\, yet I can't find anything saying that it does that! All of it seems to be unspecified behavior. Is it somewhere I'm missing?

Invalidating captures set by a failing branch involves just knowing the max index at the start and end of the branch execution\, and invalidating everything in between; restoring previous values involves saving a whole set of capture indices and restoring them on failure (which is what I think your patch does). The latter sounds a whole lot more expensive\, and would potentially slow down all alterations.

Yes\, that's what my current patch does\, and there IS a performance issue here\, since I'm currently saving and restoring ALL captures\, whether or not they're in the alternation. This is unnecessary\, but I haven't figured out yet how to set the paren floor correctly to only save the necessary ones. I think that would mitigate most of the performance impact\, though not all of it.

Would it be better to remember the captures some other way? Perl 6 returns ALL the captures; should Perl 5 have that capability too?

Deven

p5pRT commented 5 years ago

From @davidnicol

This is a good test case for the bug​:

On the other hand\, my patch does allow this example to match​:

  print "matched\n" if "ABCDA" =~ /^ (?​: (.)B | CD )* \1 $/x;

Without my patch\, this matches instead​:

  print "matched\n" if "ABCDC" =~ /^ (?​: (.)B | CD )* \1 $/x;

The bug appears to be the result of an optimization of describing the capture buffer with an offset -- essentially a dynamic substring expression -- rather than copying the captured string into it. Were the capture buffer to be copied into\, it would get 'A' (the character before the B) rather than 'C ' (the first character in the match) and the behavior would be the same as what the other regex engines do.

Is that what the patch changes?

-- "At this point\, given the limited available data\, certainty about only a very small number of things can be achieved." -- Plato\, and others

p5pRT commented 5 years ago

From @davidnicol

This is a good test case for the bug​:

On the other hand\, my patch does allow this example to match​:

  print "matched\n" if "ABCDA" =~ /^ (?​: (.)B | CD )* \1 $/x;

Without my patch\, this matches instead​:

  print "matched\n" if "ABCDC" =~ /^ (?​: (.)B | CD )* \1 $/x;

The bug appears to be the result of an optimization of describing the capture buffer with an offset -- essentially a dynamic substring expression -- rather than copying the captured string into it. Were the capture buffer to be copied into\, it would get 'A' (the character before the B) rather than 'C ' (the first character in the match) and the behavior would be the same as what the other regex engines do.

Is that what the patch changes?

-- "At this point\, given the limited available data\, certainty about only a very small number of things can be achieved." -- Plato\, and others

p5pRT commented 5 years ago

From @deven

On Tue\, Jul 17\, 2018 at 12​:58 PM David Nicol \davidnicol@&#8203;gmail\.com wrote​:

This is a good test case for the bug​:

On the other hand\, my patch does allow this example to match​:

 print "matched\\n" if "ABCDA" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

Without my patch\, this matches instead​:

 print "matched\\n" if "ABCDC" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

The bug appears to be the result of an optimization of describing the capture buffer with an offset -- essentially a dynamic substring expression -- rather than copying the captured string into it. Were the capture buffer to be copied into\, it would get 'A' (the character before the B) rather than 'C ' (the first character in the match) and the behavior would be the same as what the other regex engines do.

Is that what the patch changes?

That optimization doesn't cause the bug\, it's the attempt to match the (.) again against "CD" that causes it -- the (.) matches\, but the "D" doesn't\, and it doesn't restore the original capture.

Deven

p5pRT commented 5 years ago

From @deven

On Tue\, Jul 17\, 2018 at 12​:58 PM David Nicol \davidnicol@&#8203;gmail\.com wrote​:

This is a good test case for the bug​:

On the other hand\, my patch does allow this example to match​:

 print "matched\\n" if "ABCDA" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

Without my patch\, this matches instead​:

 print "matched\\n" if "ABCDC" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

The bug appears to be the result of an optimization of describing the capture buffer with an offset -- essentially a dynamic substring expression -- rather than copying the captured string into it. Were the capture buffer to be copied into\, it would get 'A' (the character before the B) rather than 'C ' (the first character in the match) and the behavior would be the same as what the other regex engines do.

Is that what the patch changes?

That optimization doesn't cause the bug\, it's the attempt to match the (.) again against "CD" that causes it -- the (.) matches\, but the "D" doesn't\, and it doesn't restore the original capture.

Deven

p5pRT commented 5 years ago

From @davidnicol

On Tue\, Jul 17\, 2018 at 12​:56 PM Deven T. Corzine \deven@&#8203;ties\.org wrote​:

On the other hand\, my patch does allow this example to match​:

 print "matched\\n" if "ABCDA" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

Without my patch\, this matches instead​:

 print "matched\\n" if "ABCDC" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

That optimization doesn't cause the bug\, it's the attempt to match the (.) again against "CD" that causes it -- the (.) matches\, but the "D" doesn't\, and it doesn't restore the original capture.

Deven

Thank you\, I misunderstood. So in the original demonstration\, the "b" got into $2 before the branch failed because the b was not followed by "foo"\, not due to $2 being internally tracked as an offset\, and as that branch had succeeded\, the capture was assignable.

As the current documentation (the section on "Capture Groups" in https://perldoc.perl.org/perlre.html, accessed just now) states "If a group did not match\, the associated backreference won't match either. (This can happen if the group is optional\, or in a different branch of an alternation.) " there is clearly a bug somewhere. On the other hand\, as CDCDC fails to match the test while CBCDC (unsurprisingly\, but for surprising reason) does\, so there is some kind of "did this match" knowledge happening\, otherwise CDCDC would set \1 to C before failing to match the B\, and the implementation could be interpreted as conformant with the documentation's "if a group did not match" but it takes a lot of squinting.

The current documentation (that section) contains no guidance concerning capture groups within repeating constructs. Honestly\, before today I expected regex constructions like

  "abcdef" =~ /(?​:(.))+/

to magically create $1 through $6 and load them all. This was erroneous! That's not how it works! The documentation is silent on the matter.

As an opinionated person\, I'm in favor of fixing the regression and including

  # we don't clobber capture groups with data from failed alternate branches (although we used to) * ( **"ABCD" =~ /^(?​:(.)B|CD)*$/ and $1 eq **( $] ge '5.027' ? 'A' : 'C' ))*

into the test suite and documenting how captures into buffers in alternations that passed in earlier iterations but not the most recent one used to work\, in perldoc perlre.

... After looking at https://rt-archive.perl.org/perl5/Ticket/Attachment/1566563/824618/perl-133136-test1.patch I wonder if it might be possible to defer the assignment into the capture buffers until after branches have succeeded\, rather than resetting them. This approach might require making a set of provisional capture buffers at every juncture that could become a descent into an iterating subregex containing captures\, but wouldn't be vulnerable to only operating correctly at the first level. But maybe the engine already stacks these things so with the patch

$ perl -e ' "ABCDAFCDAD" =~ /(?​:(?​:(.)B|CD)+|(?​:(.)D|A(.))*)+/ and print "$1 > $2 > $3"' C > A > F

will do the right thing\, whatever that is.

Thank you

-- "At this point\, given the limited available data\, certainty about only a very small number of things can be achieved." -- Plato\, and others

p5pRT commented 5 years ago

From @davidnicol

On Tue\, Jul 17\, 2018 at 12​:56 PM Deven T. Corzine \deven@&#8203;ties\.org wrote​:

On the other hand\, my patch does allow this example to match​:

 print "matched\\n" if "ABCDA" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

Without my patch\, this matches instead​:

 print "matched\\n" if "ABCDC" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

That optimization doesn't cause the bug\, it's the attempt to match the (.) again against "CD" that causes it -- the (.) matches\, but the "D" doesn't\, and it doesn't restore the original capture.

Deven

Thank you\, I misunderstood. So in the original demonstration\, the "b" got into $2 before the branch failed because the b was not followed by "foo"\, not due to $2 being internally tracked as an offset\, and as that branch had succeeded\, the capture was assignable.

As the current documentation (the section on "Capture Groups" in https://perldoc.perl.org/perlre.html, accessed just now) states "If a group did not match\, the associated backreference won't match either. (This can happen if the group is optional\, or in a different branch of an alternation.) " there is clearly a bug somewhere. On the other hand\, as CDCDC fails to match the test while CBCDC (unsurprisingly\, but for surprising reason) does\, so there is some kind of "did this match" knowledge happening\, otherwise CDCDC would set \1 to C before failing to match the B\, and the implementation could be interpreted as conformant with the documentation's "if a group did not match" but it takes a lot of squinting.

The current documentation (that section) contains no guidance concerning capture groups within repeating constructs. Honestly\, before today I expected regex constructions like

  "abcdef" =~ /(?​:(.))+/

to magically create $1 through $6 and load them all. This was erroneous! That's not how it works! The documentation is silent on the matter.

As an opinionated person\, I'm in favor of fixing the regression and including

  # we don't clobber capture groups with data from failed alternate branches (although we used to) * ( **"ABCD" =~ /^(?​:(.)B|CD)*$/ and $1 eq **( $] ge '5.027' ? 'A' : 'C' ))*

into the test suite and documenting how captures into buffers in alternations that passed in earlier iterations but not the most recent one used to work\, in perldoc perlre.

... After looking at https://rt-archive.perl.org/perl5/Ticket/Attachment/1566563/824618/perl-133136-test1.patch I wonder if it might be possible to defer the assignment into the capture buffers until after branches have succeeded\, rather than resetting them. This approach might require making a set of provisional capture buffers at every juncture that could become a descent into an iterating subregex containing captures\, but wouldn't be vulnerable to only operating correctly at the first level. But maybe the engine already stacks these things so with the patch

$ perl -e ' "ABCDAFCDAD" =~ /(?​:(?​:(.)B|CD)+|(?​:(.)D|A(.))*)+/ and print "$1 > $2 > $3"' C > A > F

will do the right thing\, whatever that is.

Thank you

-- "At this point\, given the limited available data\, certainty about only a very small number of things can be achieved." -- Plato\, and others

p5pRT commented 5 years ago

From @iabyn

On Tue\, Jul 17\, 2018 at 11​:42​:11AM -0400\, Deven T. Corzine wrote​:

On Tue\, Jul 17\, 2018 at 11​:02 AM\, Dave Mitchell \davem@&#8203;iabyn\.com wrote​:

So the real question is\, at the end of an alternation\, should any 'unused' captures within the alternation be flagged as invalid\, or should they preserved - retaining the values they had at the start of the alternation (which may be real values if this is a second+ iteration of an enclosing '*' etc).

Indeed\, and there is certainly room for debate here\, as both arguments are reasonable.

Personally\, invalidating a successful capture that was part of a successful match feels wrong to me. Consider this example​:

 "foobar" =~ /^ \(?&#8203;: \(foo\) | \(bar\) \)\* $/x;

Both with and without my patch\, this leaves $1="foo" and $2="bar". If the "unused" capture were to be invalidated\, this would leave $1 undefined instead. Would this be desirable?

It would match the (admittedly ambiguous) bit in perlre that David Nicol quoted​:

  If a group did not match\, the associated backreference won't match   either. (This can happen if the group is optional\, or in a different   branch of an alternation.)

Yes\, that's what my current patch does\, and there IS a performance issue here\, since I'm currently saving and restoring ALL captures\, whether or not they're in the alternation. This is unnecessary\, but I haven't figured out yet how to set the paren floor correctly to only save the necessary ones. I think that would mitigate most of the performance impact\, though not all of it.

There is considerable performance overhead in saving even just one set of capture indices - the marginal cost of saving more than one is less. So saving fewer is good\, saving none is a *lot* better.

The only ones needing saving or invalidating are the ones with indices lying between lastopen+1 .. maxopenparen (I think\, based on a quick look).

It might be worth writing an alternative patch which does just the invalidation\, rather than saving/restoring\, and see what\, if any\, tests fail. Those failures may give more insight into original intent\, and whether saving is worth it.

I suspect that writing the invalidation code might be quite tricky (in the sense that the eventual code will be simple\, but working out what that code should be exactly may be hard).

Would it be better to remember the captures some other way? Perl 6 returns ALL the captures; should Perl 5 have that capability too?

What do you mean exactly?

-- Little fly\, thy summer's play my thoughtless hand has terminated with extreme prejudice.   (with apologies to William Blake)

p5pRT commented 5 years ago

From @iabyn

On Tue\, Jul 17\, 2018 at 11​:42​:11AM -0400\, Deven T. Corzine wrote​:

On Tue\, Jul 17\, 2018 at 11​:02 AM\, Dave Mitchell \davem@&#8203;iabyn\.com wrote​:

So the real question is\, at the end of an alternation\, should any 'unused' captures within the alternation be flagged as invalid\, or should they preserved - retaining the values they had at the start of the alternation (which may be real values if this is a second+ iteration of an enclosing '*' etc).

Indeed\, and there is certainly room for debate here\, as both arguments are reasonable.

Personally\, invalidating a successful capture that was part of a successful match feels wrong to me. Consider this example​:

 "foobar" =~ /^ \(?&#8203;: \(foo\) | \(bar\) \)\* $/x;

Both with and without my patch\, this leaves $1="foo" and $2="bar". If the "unused" capture were to be invalidated\, this would leave $1 undefined instead. Would this be desirable?

It would match the (admittedly ambiguous) bit in perlre that David Nicol quoted​:

  If a group did not match\, the associated backreference won't match   either. (This can happen if the group is optional\, or in a different   branch of an alternation.)

Yes\, that's what my current patch does\, and there IS a performance issue here\, since I'm currently saving and restoring ALL captures\, whether or not they're in the alternation. This is unnecessary\, but I haven't figured out yet how to set the paren floor correctly to only save the necessary ones. I think that would mitigate most of the performance impact\, though not all of it.

There is considerable performance overhead in saving even just one set of capture indices - the marginal cost of saving more than one is less. So saving fewer is good\, saving none is a *lot* better.

The only ones needing saving or invalidating are the ones with indices lying between lastopen+1 .. maxopenparen (I think\, based on a quick look).

It might be worth writing an alternative patch which does just the invalidation\, rather than saving/restoring\, and see what\, if any\, tests fail. Those failures may give more insight into original intent\, and whether saving is worth it.

I suspect that writing the invalidation code might be quite tricky (in the sense that the eventual code will be simple\, but working out what that code should be exactly may be hard).

Would it be better to remember the captures some other way? Perl 6 returns ALL the captures; should Perl 5 have that capability too?

What do you mean exactly?

-- Little fly\, thy summer's play my thoughtless hand has terminated with extreme prejudice.   (with apologies to William Blake)

p5pRT commented 5 years ago

From @davidnicol

Personally\, invalidating a successful capture that was part of a successful match feels wrong to me. Consider this example​:

 "foobar" =~ /^ \(?&#8203;: \(foo\) | \(bar\) \)\* $/x;

Both with and without my patch\, this leaves $1="foo" and $2="bar". If the "unused" capture were to be invalidated\, this would leave $1 undefined instead. Would this be desirable?

It would match the (admittedly ambiguous) bit in perlre that David Nicol quoted​:

If a group did not match\, the associated backreference won't match
either\. \(This can happen if the group is optional\, or in a different
branch of an alternation\.\)

the expression in the capture matching isn't enough; the capture has to match and then the outer expression has to match too. I'd like to think that this can be fixed by rearranging the order of things so the assignment to the capture group is deferred until the branch the capture group is in matches.

"go" matches $1 but does not clobber $1 because "gox" is neither [fg]oo nor bar $ perl -le '"foobargox" =~ /^ (?​: ([fg]o)(o) | (bar) )* /x; print "1​:$1;2​:$2;3​:$3"' 1​:fo;2​:o;3​:bar

"go" clobbers $1 even though "gox" is not [fg]oo\, as a later alternative within this iteration matches $ perl -le '"foobargox" =~ /^ (?​: ([fg]o)(o) | gox | (bar) )* /x; print "1​:$1;2​:$2;3​:$3"' 1​:go;2​:o;3​:bar

When exactly does the assignment happen?

-- "At this point\, given the limited available data\, certainty about only a very small number of things can be achieved." -- Plato\, and others

p5pRT commented 5 years ago

From @davidnicol

Personally\, invalidating a successful capture that was part of a successful match feels wrong to me. Consider this example​:

 "foobar" =~ /^ \(?&#8203;: \(foo\) | \(bar\) \)\* $/x;

Both with and without my patch\, this leaves $1="foo" and $2="bar". If the "unused" capture were to be invalidated\, this would leave $1 undefined instead. Would this be desirable?

It would match the (admittedly ambiguous) bit in perlre that David Nicol quoted​:

If a group did not match\, the associated backreference won't match
either\. \(This can happen if the group is optional\, or in a different
branch of an alternation\.\)

the expression in the capture matching isn't enough; the capture has to match and then the outer expression has to match too. I'd like to think that this can be fixed by rearranging the order of things so the assignment to the capture group is deferred until the branch the capture group is in matches.

"go" matches $1 but does not clobber $1 because "gox" is neither [fg]oo nor bar $ perl -le '"foobargox" =~ /^ (?​: ([fg]o)(o) | (bar) )* /x; print "1​:$1;2​:$2;3​:$3"' 1​:fo;2​:o;3​:bar

"go" clobbers $1 even though "gox" is not [fg]oo\, as a later alternative within this iteration matches $ perl -le '"foobargox" =~ /^ (?​: ([fg]o)(o) | gox | (bar) )* /x; print "1​:$1;2​:$2;3​:$3"' 1​:go;2​:o;3​:bar

When exactly does the assignment happen?

-- "At this point\, given the limited available data\, certainty about only a very small number of things can be achieved." -- Plato\, and others

p5pRT commented 5 years ago

From @deven

On Tue\, Jul 17\, 2018 at 3​:54 PM\, David Nicol \davidnicol@&#8203;gmail\.com wrote​:

On Tue\, Jul 17\, 2018 at 12​:56 PM Deven T. Corzine \deven@&#8203;ties\.org wrote​:

On the other hand\, my patch does allow this example to match​:

 print "matched\\n" if "ABCDA" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

Without my patch\, this matches instead​:

 print "matched\\n" if "ABCDC" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

That optimization doesn't cause the bug\, it's the attempt to match the (.) again against "CD" that causes it -- the (.) matches\, but the "D" doesn't\, and it doesn't restore the original capture.

Deven

Thank you\, I misunderstood. So in the original demonstration\, the "b" got into $2 before the branch failed because the b was not followed by "foo"\, not due to $2 being internally tracked as an offset\, and as that branch had succeeded\, the capture was assignable.

Exactly! Now you've got it.

For reference\, my original example was​: "afoobar" =~ /((.)foo|bar)*/

On the second iteration\, (.) does successfully match the "b" in "bar"\, so it saves that capture in $2 before the "foo" attempts to match against the remaining "ar"\, which obviously fails. Since the branch fails as a whole\, the capture of "b" is invalid\, yet it gets returned anyhow. This is the bug.

As the current documentation (the section on "Capture Groups" in https://perldoc.perl.org/perlre.html, accessed just now) states "If a group did not match\, the associated backreference won't match either. (This can happen if the group is optional\, or in a different branch of an alternation.) " there is clearly a bug somewhere. On the other hand\, as CDCDC fails to match the test while CBCDC (unsurprisingly\, but for surprising reason) does\, so there is some kind of "did this match" knowledge happening\, otherwise CDCDC would set \1 to C before failing to match the B\, and the implementation could be interpreted as conformant with the documentation's "if a group did not match" but it takes a lot of squinting.

In my "afoobar" example\, the inner group doesn't match successfully against "bar"\, so it shouldn't return "b" as it does. But that group DID successfully match "a"\, so it seems to me that it ought to be returned from the successful match. If the match text were "barbar"\, then $2 should be undef because the inner group wouldn't match at all.

The current documentation (that section) contains no guidance concerning capture groups within repeating constructs.

I think you're right about this. I've searched the documentation pretty carefully and haven't found anything that describes what happens when capture groups are repeated. I'm surprised by this\, but I can't find it anywhere.

Honestly\, before today I expected regex constructions like

           "abcdef" =~ /\(?&#8203;:\(\.\)\)\+/

to magically create $1 through $6 and load them all. This was erroneous! That's not how it works! The documentation is silent on the matter.

It's not silent about this\, it's in the sentence immediately before the one you quoted! "Groups are numbered with the leftmost open parenthesis being number 1\, etc." This only applies to parens used for capturing groups\, not other parens (e.g. non-capturing groups\, lookahead matches\, etc.) -- just count the open parens to find the group number. (The branch-reset construct (?|...) is an exception to this rule\, reusing the same group numbers for each branch.)

Since your example above only has a single capturing group (.)\, it's always $1 and there is no $2. When your non-capturing group repeats\, it just keeps replacing $1 with each letter\, returning "f" at the end.

As an opinionated person\, I'm in favor of fixing the regression and including

# we don't clobber capture groups with data from failed alternate branches (although we used to) ( "ABCD" =~ /^(?​:(.)B|CD)*$/ and $1 eq ( $] ge '5.027' ? 'A' : 'C' ))

into the test suite and documenting how captures into buffers in alternations that passed in earlier iterations but not the most recent one used to work\, in perldoc perlre.

In the example above\, "C" is clearly the wrong result for $1\, since there is no "CB" in "ABCD". There should be no version check to whitewash the results -- this is a regression test and those older versions should FAIL this test because they DO contain this bug!

... After looking at https://rt-archive.perl.org/perl5/Ticket/Attachment/1566563/824618/perl-133136-test1.patch

I don't know how I ended up typing the wrong ticket number into that filename! Of course\, it should have been 133352\, not 133136...

I wonder if it might be possible to defer the assignment into the capture buffers until after branches have succeeded\, rather than resetting them.

I'm sure it's possible\, and I've been wondering if it would be better or worse.

This approach might require making a set of provisional capture buffers at every juncture that could become a descent into an iterating subregex containing captures\, but wouldn't be vulnerable to only operating correctly at the first level.

Yes\, it would require handling the possibility of multiple levels of provisional captures.

But maybe the engine already stacks these things so with the patch

$ perl -e ' "ABCDAFCDAD" =~ /(?​:(?​:(.)B|CD)+|(?​:(.)D|A(.))*)+/ and print "$1

$2 > $3"' C > A > F

will do the right thing\, whatever that is.

The correct answer is A > A > F\, which is what it returns with my patch. The bug doesn't affect the other groups in this example.

I get where you're going with this. Here's an example that makes the point more strongly​:

perl -MData​::Dumper -e '$Data​::Dumper​::Varname=""; print Dumper("ABCDEFGHIJKLMHO" =~ /^(?​:(.)(?​:(.)C|DE)+FG|H(.)(?​:(.)K|LM)+|.O)+$/);'

Perl currently returns the wrong values for ALL of these captures​:

$1 = 'H'; $2 = 'D'; $3 = 'O'; $4 = 'L';

With my patch\, the correct values are returned​:

$1 = 'A'; $2 = 'B'; $3 = 'I'; $4 = 'J';

With a bit more mangling\, I came up with this example as well​:

perl -MData​::Dumper -e '$Data​::Dumper​::Varname=""; print Dumper("ABCDEHIJKLMHO" =~ /^(?​:(.)(?​:(.)C|DE)+FG|H(.)(?​:(.)K|LM)+|.O|.*?)+$/);'

Currently\, Perl only gets $2 correct for this example\, and the other captures are wrong​:

$1 = 'H'; $2 = undef; $3 = 'O'; $4 = 'L';

Again\, with my patch\, the correct values are returned​:

$1 = undef; $2 = undef; $3 = 'I'; $4 = 'J';

Isn't this fun? :)

Thank you

Thanks for participating in the discussion!

Deven

p5pRT commented 5 years ago

From @deven

On Tue\, Jul 17\, 2018 at 3​:54 PM\, David Nicol \davidnicol@&#8203;gmail\.com wrote​:

On Tue\, Jul 17\, 2018 at 12​:56 PM Deven T. Corzine \deven@&#8203;ties\.org wrote​:

On the other hand\, my patch does allow this example to match​:

 print "matched\\n" if "ABCDA" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

Without my patch\, this matches instead​:

 print "matched\\n" if "ABCDC" =~ /^ \(?&#8203;: \(\.\)B | CD \)\* \\1 $/x;

That optimization doesn't cause the bug\, it's the attempt to match the (.) again against "CD" that causes it -- the (.) matches\, but the "D" doesn't\, and it doesn't restore the original capture.

Deven

Thank you\, I misunderstood. So in the original demonstration\, the "b" got into $2 before the branch failed because the b was not followed by "foo"\, not due to $2 being internally tracked as an offset\, and as that branch had succeeded\, the capture was assignable.

Exactly! Now you've got it.

For reference\, my original example was​: "afoobar" =~ /((.)foo|bar)*/

On the second iteration\, (.) does successfully match the "b" in "bar"\, so it saves that capture in $2 before the "foo" attempts to match against the remaining "ar"\, which obviously fails. Since the branch fails as a whole\, the capture of "b" is invalid\, yet it gets returned anyhow. This is the bug.

As the current documentation (the section on "Capture Groups" in https://perldoc.perl.org/perlre.html, accessed just now) states "If a group did not match\, the associated backreference won't match either. (This can happen if the group is optional\, or in a different branch of an alternation.) " there is clearly a bug somewhere. On the other hand\, as CDCDC fails to match the test while CBCDC (unsurprisingly\, but for surprising reason) does\, so there is some kind of "did this match" knowledge happening\, otherwise CDCDC would set \1 to C before failing to match the B\, and the implementation could be interpreted as conformant with the documentation's "if a group did not match" but it takes a lot of squinting.

In my "afoobar" example\, the inner group doesn't match successfully against "bar"\, so it shouldn't return "b" as it does. But that group DID successfully match "a"\, so it seems to me that it ought to be returned from the successful match. If the match text were "barbar"\, then $2 should be undef because the inner group wouldn't match at all.

The current documentation (that section) contains no guidance concerning capture groups within repeating constructs.

I think you're right about this. I've searched the documentation pretty carefully and haven't found anything that describes what happens when capture groups are repeated. I'm surprised by this\, but I can't find it anywhere.

Honestly\, before today I expected regex constructions like

           "abcdef" =~ /\(?&#8203;:\(\.\)\)\+/

to magically create $1 through $6 and load them all. This was erroneous! That's not how it works! The documentation is silent on the matter.

It's not silent about this\, it's in the sentence immediately before the one you quoted! "Groups are numbered with the leftmost open parenthesis being number 1\, etc." This only applies to parens used for capturing groups\, not other parens (e.g. non-capturing groups\, lookahead matches\, etc.) -- just count the open parens to find the group number. (The branch-reset construct (?|...) is an exception to this rule\, reusing the same group numbers for each branch.)

Since your example above only has a single capturing group (.)\, it's always $1 and there is no $2. When your non-capturing group repeats\, it just keeps replacing $1 with each letter\, returning "f" at the end.

As an opinionated person\, I'm in favor of fixing the regression and including

# we don't clobber capture groups with data from failed alternate branches (although we used to) ( "ABCD" =~ /^(?​:(.)B|CD)*$/ and $1 eq ( $] ge '5.027' ? 'A' : 'C' ))

into the test suite and documenting how captures into buffers in alternations that passed in earlier iterations but not the most recent one used to work\, in perldoc perlre.

In the example above\, "C" is clearly the wrong result for $1\, since there is no "CB" in "ABCD". There should be no version check to whitewash the results -- this is a regression test and those older versions should FAIL this test because they DO contain this bug!

... After looking at https://rt-archive.perl.org/perl5/Ticket/Attachment/1566563/824618/perl-133136-test1.patch

I don't know how I ended up typing the wrong ticket number into that filename! Of course\, it should have been 133352\, not 133136...

I wonder if it might be possible to defer the assignment into the capture buffers until after branches have succeeded\, rather than resetting them.

I'm sure it's possible\, and I've been wondering if it would be better or worse.

This approach might require making a set of provisional capture buffers at every juncture that could become a descent into an iterating subregex containing captures\, but wouldn't be vulnerable to only operating correctly at the first level.

Yes\, it would require handling the possibility of multiple levels of provisional captures.

But maybe the engine already stacks these things so with the patch

$ perl -e ' "ABCDAFCDAD" =~ /(?​:(?​:(.)B|CD)+|(?​:(.)D|A(.))*)+/ and print "$1

$2 > $3"' C > A > F

will do the right thing\, whatever that is.

The correct answer is A > A > F\, which is what it returns with my patch. The bug doesn't affect the other groups in this example.

I get where you're going with this. Here's an example that makes the point more strongly​:

perl -MData​::Dumper -e '$Data​::Dumper​::Varname=""; print Dumper("ABCDEFGHIJKLMHO" =~ /^(?​:(.)(?​:(.)C|DE)+FG|H(.)(?​:(.)K|LM)+|.O)+$/);'

Perl currently returns the wrong values for ALL of these captures​:

$1 = 'H'; $2 = 'D'; $3 = 'O'; $4 = 'L';

With my patch\, the correct values are returned​:

$1 = 'A'; $2 = 'B'; $3 = 'I'; $4 = 'J';

With a bit more mangling\, I came up with this example as well​:

perl -MData​::Dumper -e '$Data​::Dumper​::Varname=""; print Dumper("ABCDEHIJKLMHO" =~ /^(?​:(.)(?​:(.)C|DE)+FG|H(.)(?​:(.)K|LM)+|.O|.*?)+$/);'

Currently\, Perl only gets $2 correct for this example\, and the other captures are wrong​:

$1 = 'H'; $2 = undef; $3 = 'O'; $4 = 'L';

Again\, with my patch\, the correct values are returned​:

$1 = undef; $2 = undef; $3 = 'I'; $4 = 'J';

Isn't this fun? :)

Thank you

Thanks for participating in the discussion!

Deven

p5pRT commented 5 years ago

From @deven

On Wed\, Jul 18\, 2018 at 4​:23 AM\, Dave Mitchell \davem@&#8203;iabyn\.com wrote​:

On Tue\, Jul 17\, 2018 at 11​:42​:11AM -0400\, Deven T. Corzine wrote​:

On Tue\, Jul 17\, 2018 at 11​:02 AM\, Dave Mitchell \davem@&#8203;iabyn\.com wrote​:

So the real question is\, at the end of an alternation\, should any 'unused' captures within the alternation be flagged as invalid\, or should they preserved - retaining the values they had at the start of the alternation (which may be real values if this is a second+ iteration of an enclosing '*' etc).

Indeed\, and there is certainly room for debate here\, as both arguments are reasonable.

Personally\, invalidating a successful capture that was part of a successful match feels wrong to me. Consider this example​:

 "foobar" =~ /^ \(?&#8203;: \(foo\) | \(bar\) \)\* $/x;

Both with and without my patch\, this leaves $1="foo" and $2="bar". If the "unused" capture were to be invalidated\, this would leave $1 undefined instead. Would this be desirable?

It would match the (admittedly ambiguous) bit in perlre that David Nicol quoted​:

If a group did not match\, the associated backreference won't match
either\. \(This can happen if the group is optional\, or in a different
branch of an alternation\.\)

This says "if a group DID NOT match" -- but in the example above\, the "(foo)" group DID match against the "foo" in "foobar"\, and the "(bar)" group DID match against the "bar" in "foobar"\, so how does this quote apply here?

Yes\, that's what my current patch does\, and there IS a performance issue here\, since I'm currently saving and restoring ALL captures\, whether or not they're in the alternation. This is unnecessary\, but I haven't figured out yet how to set the paren floor correctly to only save the necessary ones. I think that would mitigate most of the performance impact\, though not all of it.

There is considerable performance overhead in saving even just one set of capture indices - the marginal cost of saving more than one is less. So saving fewer is good\, saving none is a *lot* better.

I can totally see that. I have several ideas on possible ways to solve this; do you have any thoughts on the best approach?

The only ones needing saving or invalidating are the ones with indices lying between lastopen+1 .. maxopenparen (I think\, based on a quick look).

Maybe I'm misunderstanding\, but isn't lastopen going to point at the group inside the branch after the first time it gets captured?

My current thinking is to cache the initial value of maxopenparen the first time a BRANCH or TRIE is executed\, and use that as the paren floor for future iterations.

Do you have any suggestions on how to cache that value? I haven't thought of a good solution yet.

It might be worth writing an alternative patch which does just the invalidation\, rather than saving/restoring\, and see what\, if any\, tests fail. Those failures may give more insight into original intent\, and whether saving is worth it.

I suspect that writing the invalidation code might be quite tricky (in the sense that the eventual code will be simple\, but working out what that code should be exactly may be hard).

Invalidating the captures would have less performance impact\, but it would be a significant change to the regex semantics\, since there could be a lot of code out there relying on the captures to remain valid. This bug escaped notice for 24 years\, but invalidating the captures could break working code that would be unaffected by saving and restoring the captures\, as my patch does.

Consider this arbitrary example​:

perl -MData​::Dumper -e '$Data​::Dumper​::Varname=""; print Dumper("135813885" =~ /^(?​:(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(0)|.)*$/);'

This has separate captures for each ASCII digit which can be used to find out whether or not each digit exists in the string​:

$1 = '1'; $2 = undef; $3 = '3'; $4 = undef; $5 = '5'; $6 = undef; $7 = undef; $8 = '8'; $9 = undef; $10 = undef;

If you invalidate the captures every time the branch iterates\, only the capture for the last digit ($5) would be set here\, and all the rest would be undef. On the other hand\, saving and restoring the captures doesn't change the behavior of this example at all.

I realize that digits are a little simplistic\, but a similar approach could be used to search for specific typographical errors in a block of text​: /^(?​:(\bTHe\b)|(\bteh\b)|.*?)*$/

Would it be better to remember the captures some other way? Perl 6 returns ALL the captures; should Perl 5 have that capability too?

What do you mean exactly?

Perl 6 returns nested Match objects for nested captures and arrays for quantified captures -- it never clobbers one capture with another the way most regex engines do.

For my "afoobar" =~ /((.)foo|bar)*/ example\, during the second iteration\, Perl 5 re-captures "bar" to $1\, clobbering the original value of "afoo" from the first iteration. Perl 6 returns both captures instead​:

"afoobar" ~~ /((.)foo|bar)*/ 「afoobar」 0 => 「afoo」   0 => 「a」 0 => 「bar」

Here is the actual object structure returned in Perl 6​:

("afoobar" ~~ /((.)foo|bar)*/).perl Match.new(made => Any\, pos => 7\, orig => "afoobar"\, hash => Map.new(())\, from => 0\, list => ([Match.new(made => Any\, pos => 4\, orig => "afoobar"\, hash => Map.new(())\, from => 0\, list => (Match.new(made => Any\, pos => 1\, orig => "afoobar"\, hash => Map.new(())\, from => 0\, list => ())\,))\, Match.new(made => Any\, pos => 7\, orig => "afoobar"\, hash => Map.new(())\, from => 4\, list => ())]\,))

This could be approximated with basic Perl 5 data types like this perhaps​:

# Top-level array corresponding to top-level captures. [   # Possibly include the overall regex match ($&) here?   "afoobar"\,

  # Array for the outer group ($1)\, because it is quantified.   [   # Arrays for each iteration because there are also nested captures...

  # Array for captures from the first iteration of the outer group ($1).   [   # Capture of outer group ($1) from first iteration.   "afoo"\,

  # Capture of inner group ($2) from first iteration.   "a"\,   ]\,

  # Array for captures from the second iteration of the outer group ($1).   [   # Capture of outer group ($1) from second iteration.   "bar"\,

  # Capture of inner group ($2) from second iteration\, or omit this perhaps?   undef\,   ]\,

  # No more iterations of the outer group ($1) match successfully.   ]\,

  # No more top-level captures in the regex. ]

This example is for illustrative purposes; such a data structure is subject to bikeshedding\, of course. If there are named captures\, they should probably be in a hash somewhere\, for example. That's not important at the moment.

My basic question was whether or not Perl 5 ought to have this capability to retain the "afoo" match instead of always clobbering quantified captures -- perhaps only when enabled with a flag on the regex match\, for efficiency?

What do you think?

Deven

p5pRT commented 5 years ago

From @deven

On Wed\, Jul 18\, 2018 at 4​:23 AM\, Dave Mitchell \davem@&#8203;iabyn\.com wrote​:

On Tue\, Jul 17\, 2018 at 11​:42​:11AM -0400\, Deven T. Corzine wrote​:

On Tue\, Jul 17\, 2018 at 11​:02 AM\, Dave Mitchell \davem@&#8203;iabyn\.com wrote​:

So the real question is\, at the end of an alternation\, should any 'unused' captures within the alternation be flagged as invalid\, or should they preserved - retaining the values they had at the start of the alternation (which may be real values if this is a second+ iteration of an enclosing '*' etc).

Indeed\, and there is certainly room for debate here\, as both arguments are reasonable.

Personally\, invalidating a successful capture that was part of a successful match feels wrong to me. Consider this example​:

 "foobar" =~ /^ \(?&#8203;: \(foo\) | \(bar\) \)\* $/x;

Both with and without my patch\, this leaves $1="foo" and $2="bar". If the "unused" capture were to be invalidated\, this would leave $1 undefined instead. Would this be desirable?

It would match the (admittedly ambiguous) bit in perlre that David Nicol quoted​:

If a group did not match\, the associated backreference won't match
either\. \(This can happen if the group is optional\, or in a different
branch of an alternation\.\)

This says "if a group DID NOT match" -- but in the example above\, the "(foo)" group DID match against the "foo" in "foobar"\, and the "(bar)" group DID match against the "bar" in "foobar"\, so how does this quote apply here?

Yes\, that's what my current patch does\, and there IS a performance issue here\, since I'm currently saving and restoring ALL captures\, whether or not they're in the alternation. This is unnecessary\, but I haven't figured out yet how to set the paren floor correctly to only save the necessary ones. I think that would mitigate most of the performance impact\, though not all of it.

There is considerable performance overhead in saving even just one set of capture indices - the marginal cost of saving more than one is less. So saving fewer is good\, saving none is a *lot* better.

I can totally see that. I have several ideas on possible ways to solve this; do you have any thoughts on the best approach?

The only ones needing saving or invalidating are the ones with indices lying between lastopen+1 .. maxopenparen (I think\, based on a quick look).

Maybe I'm misunderstanding\, but isn't lastopen going to point at the group inside the branch after the first time it gets captured?

My current thinking is to cache the initial value of maxopenparen the first time a BRANCH or TRIE is executed\, and use that as the paren floor for future iterations.

Do you have any suggestions on how to cache that value? I haven't thought of a good solution yet.

It might be worth writing an alternative patch which does just the invalidation\, rather than saving/restoring\, and see what\, if any\, tests fail. Those failures may give more insight into original intent\, and whether saving is worth it.

I suspect that writing the invalidation code might be quite tricky (in the sense that the eventual code will be simple\, but working out what that code should be exactly may be hard).

Invalidating the captures would have less performance impact\, but it would be a significant change to the regex semantics\, since there could be a lot of code out there relying on the captures to remain valid. This bug escaped notice for 24 years\, but invalidating the captures could break working code that would be unaffected by saving and restoring the captures\, as my patch does.

Consider this arbitrary example​:

perl -MData​::Dumper -e '$Data​::Dumper​::Varname=""; print Dumper("135813885" =~ /^(?​:(1)|(2)|(3)|(4)|(5)|(6)|(7)|(8)|(9)|(0)|.)*$/);'

This has separate captures for each ASCII digit which can be used to find out whether or not each digit exists in the string​:

$1 = '1'; $2 = undef; $3 = '3'; $4 = undef; $5 = '5'; $6 = undef; $7 = undef; $8 = '8'; $9 = undef; $10 = undef;

If you invalidate the captures every time the branch iterates\, only the capture for the last digit ($5) would be set here\, and all the rest would be undef. On the other hand\, saving and restoring the captures doesn't change the behavior of this example at all.

I realize that digits are a little simplistic\, but a similar approach could be used to search for specific typographical errors in a block of text​: /^(?​:(\bTHe\b)|(\bteh\b)|.*?)*$/

Would it be better to remember the captures some other way? Perl 6 returns ALL the captures; should Perl 5 have that capability too?

What do you mean exactly?

Perl 6 returns nested Match objects for nested captures and arrays for quantified captures -- it never clobbers one capture with another the way most regex engines do.

For my "afoobar" =~ /((.)foo|bar)*/ example\, during the second iteration\, Perl 5 re-captures "bar" to $1\, clobbering the original value of "afoo" from the first iteration. Perl 6 returns both captures instead​:

"afoobar" ~~ /((.)foo|bar)*/ 「afoobar」 0 => 「afoo」   0 => 「a」 0 => 「bar」

Here is the actual object structure returned in Perl 6​:

("afoobar" ~~ /((.)foo|bar)*/).perl Match.new(made => Any\, pos => 7\, orig => "afoobar"\, hash => Map.new(())\, from => 0\, list => ([Match.new(made => Any\, pos => 4\, orig => "afoobar"\, hash => Map.new(())\, from => 0\, list => (Match.new(made => Any\, pos => 1\, orig => "afoobar"\, hash => Map.new(())\, from => 0\, list => ())\,))\, Match.new(made => Any\, pos => 7\, orig => "afoobar"\, hash => Map.new(())\, from => 4\, list => ())]\,))

This could be approximated with basic Perl 5 data types like this perhaps​:

# Top-level array corresponding to top-level captures. [   # Possibly include the overall regex match ($&) here?   "afoobar"\,

  # Array for the outer group ($1)\, because it is quantified.   [   # Arrays for each iteration because there are also nested captures...

  # Array for captures from the first iteration of the outer group ($1).   [   # Capture of outer group ($1) from first iteration.   "afoo"\,

  # Capture of inner group ($2) from first iteration.   "a"\,   ]\,

  # Array for captures from the second iteration of the outer group ($1).   [   # Capture of outer group ($1) from second iteration.   "bar"\,

  # Capture of inner group ($2) from second iteration\, or omit this perhaps?   undef\,   ]\,

  # No more iterations of the outer group ($1) match successfully.   ]\,

  # No more top-level captures in the regex. ]

This example is for illustrative purposes; such a data structure is subject to bikeshedding\, of course. If there are named captures\, they should probably be in a hash somewhere\, for example. That's not important at the moment.

My basic question was whether or not Perl 5 ought to have this capability to retain the "afoo" match instead of always clobbering quantified captures -- perhaps only when enabled with a flag on the regex match\, for efficiency?

What do you think?

Deven

p5pRT commented 5 years ago

From @deven

On Wed\, Jul 18\, 2018 at 2​:00 PM\, David Nicol \davidnicol@&#8203;gmail\.com wrote​:

the expression in the capture matching isn't enough; the capture has to match and then the outer expression has to match too. I'd like to think that this can be fixed by rearranging the order of things so the assignment to the capture group is deferred until the branch the capture group is in matches.

Yes\, that's one possible way to fix it\, but it might be tricky to handle backtracking with nested alternations.

"go" clobbers $1 even though "gox" is not [fg]oo\, as a later alternative within this iteration matches $ perl -le '"foobargox" =~ /^ (?​: ([fg]o)(o) | gox | (bar) )* /x; print "1​:$1;2​:$2;3​:$3"' 1​:go;2​:o;3​:bar

This is another variation of the same bug\, $1 should be "fo".

When exactly does the assignment happen?

It happens immediately and should be rolled back when backtracking. Right now\, that's not happening when it should here.

Deven

p5pRT commented 5 years ago

From @deven

On Wed\, Jul 18\, 2018 at 2​:00 PM\, David Nicol \davidnicol@&#8203;gmail\.com wrote​:

the expression in the capture matching isn't enough; the capture has to match and then the outer expression has to match too. I'd like to think that this can be fixed by rearranging the order of things so the assignment to the capture group is deferred until the branch the capture group is in matches.

Yes\, that's one possible way to fix it\, but it might be tricky to handle backtracking with nested alternations.

"go" clobbers $1 even though "gox" is not [fg]oo\, as a later alternative within this iteration matches $ perl -le '"foobargox" =~ /^ (?​: ([fg]o)(o) | gox | (bar) )* /x; print "1​:$1;2​:$2;3​:$3"' 1​:go;2​:o;3​:bar

This is another variation of the same bug\, $1 should be "fo".

When exactly does the assignment happen?

It happens immediately and should be rolled back when backtracking. Right now\, that's not happening when it should here.

Deven

p5pRT commented 5 years ago

From @davidnicol

On Sun\, Jul 22\, 2018 at 1​:59 PM Deven T. Corzine \deven@&#8203;ties\.org wrote​:

On Wed\, Jul 18\, 2018 at 2​:00 PM\, David Nicol \davidnicol@&#8203;gmail\.com wrote​:

When exactly does the assignment happen?

It happens immediately and should be rolled back when backtracking. Right now\, that's not happening when it should here.

how much performance impact would using a linked-list for every capture have\, to facilitate the rollback? Or\, to stay within perl data\, use an array structure​: instead of clobbering\, unshift; roll back by shifting; at the very end assign from [0] elements of all internal capture arrays. Captures syntactically impossible to roll back (are there any?) would be exempt\, backreferences would refer to the current [0] elt of each capture array\, or the head of each capture linked-list.

-- "At this point\, given the limited available data\, certainty about only a very small number of things can be achieved." -- Plato\, and others

p5pRT commented 5 years ago

From @davidnicol

On Sun\, Jul 22\, 2018 at 1​:59 PM Deven T. Corzine \deven@&#8203;ties\.org wrote​:

On Wed\, Jul 18\, 2018 at 2​:00 PM\, David Nicol \davidnicol@&#8203;gmail\.com wrote​:

When exactly does the assignment happen?

It happens immediately and should be rolled back when backtracking. Right now\, that's not happening when it should here.

how much performance impact would using a linked-list for every capture have\, to facilitate the rollback? Or\, to stay within perl data\, use an array structure​: instead of clobbering\, unshift; roll back by shifting; at the very end assign from [0] elements of all internal capture arrays. Captures syntactically impossible to roll back (are there any?) would be exempt\, backreferences would refer to the current [0] elt of each capture array\, or the head of each capture linked-list.

-- "At this point\, given the limited available data\, certainty about only a very small number of things can be achieved." -- Plato\, and others

p5pRT commented 5 years ago

From @deven

On Mon\, Jul 23\, 2018 at 11​:40 AM\, David Nicol \davidnicol@&#8203;gmail\.com wrote​:

how much performance impact would using a linked-list for every capture have\, to facilitate the rollback? Or\, to stay within perl data\, use an array structure​: instead of clobbering\, unshift; roll back by shifting; at the very end assign from [0] elements of all internal capture arrays. Captures syntactically impossible to roll back (are there any?) would be exempt\, backreferences would refer to the current [0] elt of each capture array\, or the head of each capture linked-list.

Perl's high-level data structures would be too slow to make sense here. Perl already uses slab allocation for other data structures\, including for the regex engine\, and if this couldn't be done with the existing ones\, a new one could be made. Slab allocation is much faster than a straightforward inked list where you malloc() each node separately.

Regardless\, I've also been wondering if saving captures a different way like this would be better. I'm not sure offhand\, but I believe it's worth looking into...

Deven

p5pRT commented 5 years ago

From @deven

On Mon\, Jul 23\, 2018 at 11​:40 AM\, David Nicol \davidnicol@&#8203;gmail\.com wrote​:

how much performance impact would using a linked-list for every capture have\, to facilitate the rollback? Or\, to stay within perl data\, use an array structure​: instead of clobbering\, unshift; roll back by shifting; at the very end assign from [0] elements of all internal capture arrays. Captures syntactically impossible to roll back (are there any?) would be exempt\, backreferences would refer to the current [0] elt of each capture array\, or the head of each capture linked-list.

Perl's high-level data structures would be too slow to make sense here. Perl already uses slab allocation for other data structures\, including for the regex engine\, and if this couldn't be done with the existing ones\, a new one could be made. Slab allocation is much faster than a straightforward inked list where you malloc() each node separately.

Regardless\, I've also been wondering if saving captures a different way like this would be better. I'm not sure offhand\, but I believe it's worth looking into...

Deven

p5pRT commented 5 years ago

From @deven

It's been quiet for a couple weeks. Has anyone had a chance to take a look at the patch yet?

Deven

p5pRT commented 5 years ago

From @deven

It's been quiet for a couple weeks. Has anyone had a chance to take a look at the patch yet?

Deven

p5pRT commented 5 years ago

From @iabyn

On Sun\, Aug 12\, 2018 at 09​:39​:16PM -0400\, Deven T. Corzine wrote​:

It's been quiet for a couple weeks. Has anyone had a chance to take a look at the patch yet?

(Yves\, I'e CCed you as there's a question for you further down this email.)

Sorry for the late reply. I took the opportunity to try to properly understand this issue (you may remember me mentioning at the beginning how "I hate and fear the capture save and restore code in regmatch()").

I think I now have a much better handle on this. I've just pushed a branch which contains a proposed fix\, along with lots of tests and benchmark entries. It avoids saving and restoring capture indices unless the alternation is wrapped in a repeat\, and only saves indices back to the start of the repeat. This is less inefficient than your original suggestion\, but it still makes alternations with captures within a repeat run at about 50% of their former speed.

I know a way to make it much faster\, but it involves\, for every BRANCH/BRANCHJ/TRIE/TRIEC node\, knowing the current capture index - i.e. in the same way that OPEN nodes have 'n = ARG(scan);' but I don't know how this can stored.

Yves\, is there type of node - or a way to extend the current nodes - such that a 32-bit capture index can stored as part of each of these nodes at compile time? Without breaking everything?

Anyway\, I've pushed my current proposed fix as smoke-me/davem/captures\, and here's the commit message​:

commit d0f9ad5d5ba5300a04bba7be2ea8cded41e95666 Author​: David Mitchell \davem@&#8203;iabyn\.com AuthorDate​: Tue Aug 28 09​:15​:04 2018 +0100 Commit​: David Mitchell \davem@&#8203;iabyn\.com CommitDate​: Tue Aug 28 15​:32​:46 2018 +0100

  regex​: restore failed branch capture within repeat  
  RT #133352  
  There are two competing behaviours for what to do with any captures   within a failed alternation branch.  
  Normally the regex engine invalidates any captures created during   execution of the failed branch. For example in something like  
  /(A) (?​: (B)(C) | (D)(E) ) /x  
  perl remembers the current highest capture index at the start of each   branch attempt (1 in this example)\, and if the branch fails it   invalidates all captures above this level (i.e. 2..3 or 2..5 depending   on which branch failed) before trying the next branch (if any).  
  However for repeats\, such as (...)*\, on each new iteration any captures   inside the repeat initially maintain their value from the previous   iteration\, until updated within the current iteration. For example in  
  "ABCD" =~ /(?​: (.)(.) )*/x  
  At the start of the first iteration\, \1 and \2 are invalid.   At the start of the second iteration\, \1 and \2 are 'A'\, 'B'.   During the course of the second iteration\, \1 changes to 'C'\, then   \2 changes to 'D'.  
  This causes a problem when there's an alternation within a repeat.   Taking the first example above and adding a repeat​:  
  /(A) (?​: (B)(C) | (D)(E) )* /x  
  Now on the second iteration\, on entry to the branch the current highest   capture index is no longer 1 (i.e. A)\, but is instead 3 or 5 depending   on what was captured on the previous iteration. Thus on branch failure\,   the 'disable everything above the previous floor' technique no longer   works​: rather than disabling (1+1)..5 say\, it disables (5+1)..5 - i.e.   failed captures are left as-is. This is the essence of the bug in this   ticket.  
  It is not clear what the correct behaviour should be - arguably failed   branch captures could be either invalidated or restored to their value   at the start of the branch. Invalidating is probably cheaper. However\,   there are a couple of tests in t/te/re_tests added many years ago by   Ilya which assume the latter behaviour (they pass because the pattern of   branches and backtracking happen to leave $1 with the correct value;   a slight alteration to the tests and they would have the wrong value).   But in any case the tests aren't expecting $1 to be undefined.  
  This commit fixes the bug by\, at the start of any branch\, saving any   captures beyond the start of the innermost enclosing repeat\, but only if   it *is* within a repeat\, and if there are any valid captures beyond the   start of that repeat. These are restored on branch failure. Otherwise   it just invalidates back to the highest capture index at the start of   the branch\, as before.  
  So for example this is unchanged\, just invalidating indices 2+ on each   branch failure​:  
  /(A) (?​: (B)(C) | (D)(E) ) /x  
  While this​:  
  "ABCD" =~ /(?​: ([AC]) ([BD]) )*/x  
  does invalidation on the first iteration\, and full saving/restore of   capture indices 2+ on subsequent iterations\, while this​:  
  "-AB-CD" =~ /(?​: (-) ([AC]) ([BD]) )*/x  
  saves and restores on *every* iteration\, since the (-) is above the   floor of the curlyx\, so at the start of each branch\, lastparen is always   above cur_curlyx->lastparen.  
  In terms of performance\, the benchmarks included with this commit show   that a plain branch is unaffected; a plain branch with captures is a few   % slower\, and a branch within a repeat is now about half the speed.  
  This fix could be made a lot more efficient if the current paren index   was stored in every BRANCH/BRANCHJ/TRIE/TRIEC node\, and if a pointer to   the current WHILEM node was maintained. Since every iteration of a   CURLYX/WHILEM saves all paren indices above the curlyx floor anyway\, it   wouldn't be necessary to save them again for each branch; just on branch   failure\, restore all saved indices from that last WHILEM which are above   the branch floor\,

-- The warp engines start playing up a bit\, but seem to sort themselves out after a while without any intervention from boy genius Wesley Crusher.   -- Things That Never Happen in "Star Trek" #17

p5pRT commented 5 years ago

From @iabyn

On Sun\, Aug 12\, 2018 at 09​:39​:16PM -0400\, Deven T. Corzine wrote​:

It's been quiet for a couple weeks. Has anyone had a chance to take a look at the patch yet?

(Yves\, I'e CCed you as there's a question for you further down this email.)

Sorry for the late reply. I took the opportunity to try to properly understand this issue (you may remember me mentioning at the beginning how "I hate and fear the capture save and restore code in regmatch()").

I think I now have a much better handle on this. I've just pushed a branch which contains a proposed fix\, along with lots of tests and benchmark entries. It avoids saving and restoring capture indices unless the alternation is wrapped in a repeat\, and only saves indices back to the start of the repeat. This is less inefficient than your original suggestion\, but it still makes alternations with captures within a repeat run at about 50% of their former speed.

I know a way to make it much faster\, but it involves\, for every BRANCH/BRANCHJ/TRIE/TRIEC node\, knowing the current capture index - i.e. in the same way that OPEN nodes have 'n = ARG(scan);' but I don't know how this can stored.

Yves\, is there type of node - or a way to extend the current nodes - such that a 32-bit capture index can stored as part of each of these nodes at compile time? Without breaking everything?

Anyway\, I've pushed my current proposed fix as smoke-me/davem/captures\, and here's the commit message​:

commit d0f9ad5d5ba5300a04bba7be2ea8cded41e95666 Author​: David Mitchell \davem@&#8203;iabyn\.com AuthorDate​: Tue Aug 28 09​:15​:04 2018 +0100 Commit​: David Mitchell \davem@&#8203;iabyn\.com CommitDate​: Tue Aug 28 15​:32​:46 2018 +0100

  regex​: restore failed branch capture within repeat  
  RT #133352  
  There are two competing behaviours for what to do with any captures   within a failed alternation branch.  
  Normally the regex engine invalidates any captures created during   execution of the failed branch. For example in something like  
  /(A) (?​: (B)(C) | (D)(E) ) /x  
  perl remembers the current highest capture index at the start of each   branch attempt (1 in this example)\, and if the branch fails it   invalidates all captures above this level (i.e. 2..3 or 2..5 depending   on which branch failed) before trying the next branch (if any).  
  However for repeats\, such as (...)*\, on each new iteration any captures   inside the repeat initially maintain their value from the previous   iteration\, until updated within the current iteration. For example in  
  "ABCD" =~ /(?​: (.)(.) )*/x  
  At the start of the first iteration\, \1 and \2 are invalid.   At the start of the second iteration\, \1 and \2 are 'A'\, 'B'.   During the course of the second iteration\, \1 changes to 'C'\, then   \2 changes to 'D'.  
  This causes a problem when there's an alternation within a repeat.   Taking the first example above and adding a repeat​:  
  /(A) (?​: (B)(C) | (D)(E) )* /x  
  Now on the second iteration\, on entry to the branch the current highest   capture index is no longer 1 (i.e. A)\, but is instead 3 or 5 depending   on what was captured on the previous iteration. Thus on branch failure\,   the 'disable everything above the previous floor' technique no longer   works​: rather than disabling (1+1)..5 say\, it disables (5+1)..5 - i.e.   failed captures are left as-is. This is the essence of the bug in this   ticket.  
  It is not clear what the correct behaviour should be - arguably failed   branch captures could be either invalidated or restored to their value   at the start of the branch. Invalidating is probably cheaper. However\,   there are a couple of tests in t/te/re_tests added many years ago by   Ilya which assume the latter behaviour (they pass because the pattern of   branches and backtracking happen to leave $1 with the correct value;   a slight alteration to the tests and they would have the wrong value).   But in any case the tests aren't expecting $1 to be undefined.  
  This commit fixes the bug by\, at the start of any branch\, saving any   captures beyond the start of the innermost enclosing repeat\, but only if   it *is* within a repeat\, and if there are any valid captures beyond the   start of that repeat. These are restored on branch failure. Otherwise   it just invalidates back to the highest capture index at the start of   the branch\, as before.  
  So for example this is unchanged\, just invalidating indices 2+ on each   branch failure​:  
  /(A) (?​: (B)(C) | (D)(E) ) /x  
  While this​:  
  "ABCD" =~ /(?​: ([AC]) ([BD]) )*/x  
  does invalidation on the first iteration\, and full saving/restore of   capture indices 2+ on subsequent iterations\, while this​:  
  "-AB-CD" =~ /(?​: (-) ([AC]) ([BD]) )*/x  
  saves and restores on *every* iteration\, since the (-) is above the   floor of the curlyx\, so at the start of each branch\, lastparen is always   above cur_curlyx->lastparen.  
  In terms of performance\, the benchmarks included with this commit show   that a plain branch is unaffected; a plain branch with captures is a few   % slower\, and a branch within a repeat is now about half the speed.  
  This fix could be made a lot more efficient if the current paren index   was stored in every BRANCH/BRANCHJ/TRIE/TRIEC node\, and if a pointer to   the current WHILEM node was maintained. Since every iteration of a   CURLYX/WHILEM saves all paren indices above the curlyx floor anyway\, it   wouldn't be necessary to save them again for each branch; just on branch   failure\, restore all saved indices from that last WHILEM which are above   the branch floor\,

-- The warp engines start playing up a bit\, but seem to sort themselves out after a while without any intervention from boy genius Wesley Crusher.   -- Things That Never Happen in "Star Trek" #17

p5pRT commented 5 years ago

From @deven

On Tue\, Aug 28\, 2018 at 10​:47 AM Dave Mitchell \davem@&#8203;iabyn\.com wrote​:

On Sun\, Aug 12\, 2018 at 09​:39​:16PM -0400\, Deven T. Corzine wrote​:

It's been quiet for a couple weeks. Has anyone had a chance to take a look at the patch yet?

(Yves\, I'e CCed you as there's a question for you further down this email.)

Sorry for the late reply. I took the opportunity to try to properly understand this issue (you may remember me mentioning at the beginning how "I hate and fear the capture save and restore code in regmatch()").

Sorry\, now I'm the one being slow to reply! Yes\, I remember that comment\, which certainly illustrates why so many people told me I was crazy to attempt a patch myself!

I think I now have a much better handle on this. I've just pushed a branch which contains a proposed fix\, along with lots of tests and benchmark entries. It avoids saving and restoring capture indices unless the alternation is wrapped in a repeat\, and only saves indices back to the start of the repeat. This is less inefficient than your original suggestion\, but it still makes alternations with captures within a repeat run at about 50% of their former speed.

I submitted my initial patch at James Keenan's request\, solely as a starting point for discussion. I stated in the accompanying notes that I didn't consider the patch ready to apply yet\, for reasons including the unacceptable performance impact on regexes that aren't even affected by the bug in the first place. I was hoping to find a clean way to set a paren floor (as you've done with your patch)\, to mitigate the performance impact that would otherwise affect every regex with captures inside branches. I'd hate to have a 50% performance impact on common regexes just to fix a bug that's so rarely encountered that it went unnoticed for decades!

My initial patch was only focused on correctness\, not performance. Stipulating that the performance was unacceptably slow\, could you tell me if my original patch was clean and valid from a correctness perspective\, or was I doing something wrong/ugly/unvalid in the patch?

I know a way to make it much faster\, but it involves\, for every

BRANCH/BRANCHJ/TRIE/TRIEC node\, knowing the current capture index - i.e. in the same way that OPEN nodes have 'n = ARG(scan);' but I don't know how this can stored.

Yves\, is there type of node - or a way to extend the current nodes - such that a 32-bit capture index can stored as part of each of these nodes at compile time? Without breaking everything?

I know there are only 8 bits to distinguish them\, but perhaps this is a situation where adding new regnode types would be worthwhile? Am I right in thinking that a 32-bit capture index and the WHILEM pointer could be stored that way?

Alternatively\, if "ARG(scan)" or another mechanism could be used to save even an 8-bit or 16-bit index value\, it would probably cover most use cases\, and the max value (255 or 65535) could be used by default if the index doesn't fit. I'm sure the vast majority of regexes have 255 or fewer captures in the first place\, and in the rare case where a regex has thousands of captures and the proper index won't fit\, this would only cause an extra unnecessary performance for those rare cases\, and the correctness of the result should not be affected.

Anyway\, I've pushed my current proposed fix as smoke-me/davem/captures\,

and here's the commit message​:

I'll definitely check out your proposed fix; hopefully I'll get to it sooner than later!

Thanks for looking into this!

Deven

p5pRT commented 5 years ago

From @iabyn

On Tue\, Oct 16\, 2018 at 03​:18​:07PM -0400\, Deven T. Corzine wrote​:

My initial patch was only focused on correctness\, not performance. Stipulating that the performance was unacceptably slow\, could you tell me if my original patch was clean and valid from a correctness perspective\, or was I doing something wrong/ugly/unvalid in the patch?

Since nearly two months have passed since I worked on this\, the details have ebbed from my mind so I can't remember whether your approach was merely slow or had issues.

I know a way to make it much faster\, but it involves\, for every

BRANCH/BRANCHJ/TRIE/TRIEC node\, knowing the current capture index - i.e. in the same way that OPEN nodes have 'n = ARG(scan);' but I don't know how this can stored.

Yves\, is there type of node - or a way to extend the current nodes - such that a 32-bit capture index can stored as part of each of these nodes at compile time? Without breaking everything?

I know there are only 8 bits to distinguish them\, but perhaps this is a situation where adding new regnode types would be worthwhile? Am I right in thinking that a 32-bit capture index and the WHILEM pointer could be stored that way?

The node just needs to store the capture index\, while the runtime code in S_regmatch needs to keep track of the last seen WHILEM node\, presumably in a local var (that may need saving and restoring).

Alternatively\, if "ARG(scan)" or another mechanism could be used to save even an 8-bit or 16-bit index value\, it would probably cover most use cases\, and the max value (255 or 65535) could be used by default if the index doesn't fit. I'm sure the vast majority of regexes have 255 or fewer captures in the first place\, and in the rare case where a regex has thousands of captures and the proper index won't fit\, this would only cause an extra unnecessary performance for those rare cases\, and the correctness of the result should not be affected.

With the current node types used by the various branch ops\, there is no spare space to store even an 8-bit index for the TRIE/TRIEC nodes\, since they make use of the 8-but flags field. (That field appears to be unused in BRANCH/BRANCHJ).

Which is why I asked Yves about the best way to extend such nodes. The naive way would be to use regnode_1 for BRANCH which adds an extra 32-bit field\, and 'upgrade' BRANCHJ from regnode_1 to regnode_2L; But the TRIE nodes are more complex than that and I don't think I could safely add an extra 32-bit field without breaking things.

-- "Strange women lying in ponds distributing swords is no basis for a system of government. Supreme executive power derives from a mandate from the masses\, not from some farcical aquatic ceremony."   -- Dennis\, "Monty Python and the Holy Grail"

p5pRT commented 5 years ago

From @deven

On Wed\, Oct 17\, 2018 at 7​:02 AM Dave Mitchell \davem@&#8203;iabyn\.com wrote​:

On Tue\, Oct 16\, 2018 at 03​:18​:07PM -0400\, Deven T. Corzine wrote​:

My initial patch was only focused on correctness\, not performance. Stipulating that the performance was unacceptably slow\, could you tell me if my original patch was clean and valid from a correctness perspective\, or was I doing something wrong/ugly/unvalid in the patch?

Since nearly two months have passed since I worked on this\, the details have ebbed from my mind so I can't remember whether your approach was merely slow or had issues.

I had already spent probably 20 hours or more tracking down this bug while I was at YAPC. After leaving the conference\, I spent at least 4-6 more hours very carefully studying and tracing the code\, including running test cases with debugging output many times\, adding extra debugging output of my own to trace key variables through all the state changes during a regex match of my test case.

After all that research and testing\, I felt confident that I finally understood regcppush()\, REGCP_SET()\, REGCP_UNWIND()\, UNWIND_PARENT() and regcp_restore() pretty well\, so I started developing this patch. I spent about an hour writing the patch and 2-3 hours debugging it\, because I initially overlooked something simple that should have been obvious. I forget what it was\, but it was something dumb that I should have noticed right away when the regression tests failed -- when I finally figured out what I did wrong\, I fixed the code to be what I actually _thought_ that I had typed in the first place! (That's why it took me hours to notice it.)

Ultimately\, the changes were fairly straightforward\, considering. I'll describe the changes I made\, in chronological order. (The hunks in the patch are in a different order.)

First\, I added "CHECKPOINT lastcp;" to the "regmatch_state\,u" union in the "branchlike"\, "branch" and "trie" structs\, just after "CHECKPOINT cp;".

Next\, I used regcppush() to save the captures near the beginning of "case BRANCH"​:

@​@​ -8043\,7 +8042\,10 @​@​ NULL   ST.lastparen = rex->lastparen;   ST.lastcloseparen = rex->lastcloseparen;   ST.next_branch = next; - REGCP_SET(ST.cp); + + /* save state of captures at branch start */ + ST.cp = regcppush(rex\, 0\, maxopenparen); + REGCP_SET(ST.lastcp);

  /* Now go into the branch */   if (has_cutgroup) {

From looking at the existing code\, this appeared the right way to save all the captures. I realized that saving ALL the captures was overkill and very slow\, of course. I was initially only attempting to get the correct captures back from my test case without breaking any regression tests\, and this patch did succeed at doing that. (I was still studying the code to try to figure out a clean way to set a paren floor like CURLYX does.) I didn't see any point in keeping the original REGCP_SET() call here.

I did consider whether "case BRANCHJ" might need any separate changes\, but "case BRANCHJ" falls through to "case BRANCH"\, so this change seems to cover it.

Next\, I used regcp_restore() to restore the saved captures in "case BRANCH_next_fail"​:

@​@​ -8077\,8 +8079\,10 @​@​ NULL   do_cutgroup = 0;   no_final = 0;   } - REGCP_UNWIND(ST.cp); - UNWIND_PAREN(ST.lastparen\, ST.lastcloseparen); + + /* restore state of captures from branch start */ + regcp_restore(rex\, ST.lastcp\, &maxopenparen); +   scan = ST.next_branch;   /* no more branches? */   if (!scan || (OP(scan) != BRANCH && OP(scan) != BRANCHJ)) {

From studying the existing code\, regcp_restore() appeared to be the right way to restore the saved captures without discarding them. The original REGCP_UNWIND() and UNWIND_PAREN() calls were clearly intended to undo the captures from the failed alternation\, but that approach seems to be unworkable for these edge cases involving captures inside repeated alternations. I suspect UNWIND_PAREN() might always have such edge cases\, but I haven't investigated that hunch.

Finally\, I mirrored both of these changes in "case TRIE" and "case TRIE_next_fail"\, for trie-optimized alternations​:

@​@​ -5920\,6 +5920\,10 @​@​ S_regmatch(pTHX_ regmatch_info *reginfo\, char *startpos\, regnode *prog)   HV * widecharmap = MUTABLE_HV(rexi->data->data[ ARG( scan ) + 1 ]);   U32 state = trie->startstate;

+ /* save state of captures at branch start */ + ST.cp = regcppush(rex\, 0\, maxopenparen); + REGCP_SET(ST.lastcp); +   if (scan->flags == EXACTL || scan->flags == EXACTFLU8) {   _CHECK_AND_WARN_PROBLEMATIC_LOCALE;   if (utf8_target @​@​ -6067\,15 +6071\,10 @​@​ S_regmatch(pTHX_ regmatch_info *reginfo\, char *startpos\, regnode *prog)   case TRIE_next_fail​: /* we failed - try next alternative */   {   U8 *uc; - if ( ST.jump ) { - /* undo any captures done in the tail part of a branch\, - * e.g. - * /(?​:X(.)(.)|Y(.)).../ - * where the trie just matches X then calls out to do the - * rest of the branch */ - REGCP_UNWIND(ST.cp); - UNWIND_PAREN(ST.lastparen\, ST.lastcloseparen); - } + + /* restore state of captures from branch start */ + regcp_restore(rex\, ST.lastcp\, &maxopenparen); +   if (!--ST.accepted) {   DEBUG_EXECUTE_r({   Perl_re_exec_indentf( aTHX_ "%sTRIE failed...%s\n"\,

I didn't analyze ST.jump as fully as regcppush() and friends\, so I was a bit unsure about removing this ST.jump code\, but the regcp_restore() seemed to make it redundant. Since all the regression tests passed\, I included this change in the patch. Was it correct to remove that code?

Come to think of it\, perhaps I should have also removed this other ST.jump block near the beginning of "case trie_first_try"?

  if ( ST.jump ) {   ST.lastparen = rex->lastparen;   ST.lastcloseparen = rex->lastcloseparen;   REGCP_SET(ST.cp);   }

Looking at it again\, this seems redundant too\, but I didn't think to try removing it when I developed the initial patch.

I'm not sure why the TRIE case was only dealing with the captures when ST.jump is set. Is it possible that captures might need to be restored when ST.jump isn't set?

The node just needs to store the capture index\, while the runtime code

in S_regmatch needs to keep track of the last seen WHILEM node\, presumably in a local var (that may need saving and restoring).

That makes sense. Mind if I take a stab at that\, if I get a chance?

With the current node types used by the various branch ops\, there is no spare space to store even an 8-bit index for the TRIE/TRIEC nodes\, since they make use of the 8-but flags field. (That field appears to be unused in BRANCH/BRANCHJ).

Yeah\, I came to the same conclusions about the flags too.

Which is why I asked Yves about the best way to extend such nodes. The naive way would be to use regnode_1 for BRANCH which adds an extra 32-bit field\, and 'upgrade' BRANCHJ from regnode_1 to regnode_2L; But the TRIE nodes are more complex than that and I don't think I could safely add an extra 32-bit field without breaking things.

This sounds like a viable option. Why do you call it naive?

The TRIE nodes are more complex\, but is there a particular reason you're wary of breaking things by adding a field to it?

If the flags can't be used\, I'm thinking that new regnode types make sense here. That way\, the regex compiler could trigger the save/restore functionality only when needed\, provide the correct paren floor and avoid increasing the size of compiled regexes that don't need to restore captures like this. It would take more work to implement\, of course\, and use up a few of the unused 8-bit regnode slots\, but it seems like it might be the most effective way to optimize this. What do you and Yves think about this idea?

Thanks again for taking the time to work on this bug. (I hope I'm not wasting too much of your time on the conversation!)

Deven

p5pRT commented 5 years ago

From @demerphq

I'm sorry I have missed some of this discussion over the past while...

On Wed\, 17 Oct 2018\, 16​:29 Deven T. Corzine\, \deven@&#8203;ties\.org wrote​:

On Wed\, Oct 17\, 2018 at 7​:02 AM Dave Mitchell \davem@&#8203;iabyn\.com wrote​:

On Tue\, Oct 16\, 2018 at 03​:18​:07PM -0400\, Deven T. Corzine wrote​:

My initial patch was only focused on correctness\, not performance. Stipulating that the performance was unacceptably slow\, could you tell me if my original patch was clean and valid from a correctness perspective\, or was I doing something wrong/ugly/unvalid in the patch?

Since nearly two months have passed since I worked on this\, the details have ebbed from my mind so I can't remember whether your approach was merely slow or had issues.

I had already spent probably 20 hours or more tracking down this bug while I was at YAPC. After leaving the conference\, I spent at least 4-6 more hours very carefully studying and tracing the code\, including running test cases with debugging output many times\, adding extra debugging output of my own to trace key variables through all the state changes during a regex match of my test case.

After all that research and testing\, I felt confident that I finally understood regcppush()\, REGCP_SET()\, REGCP_UNWIND()\, UNWIND_PARENT() and regcp_restore() pretty well\, so I started developing this patch. I spent about an hour writing the patch and 2-3 hours debugging it\, because I initially overlooked something simple that should have been obvious. I forget what it was\, but it was something dumb that I should have noticed right away when the regression tests failed -- when I finally figured out what I did wrong\, I fixed the code to be what I actually _thought_ that I had typed in the first place! (That's why it took me hours to notice it.)

Ultimately\, the changes were fairly straightforward\, considering. I'll describe the changes I made\, in chronological order. (The hunks in the patch are in a different order.)

First\, I added "CHECKPOINT lastcp;" to the "regmatch_state\,u" union in the "branchlike"\, "branch" and "trie" structs\, just after "CHECKPOINT cp;".

Next\, I used regcppush() to save the captures near the beginning of "case BRANCH"​:

@​@​ -8043\,7 +8042\,10 @​@​ NULL ST.lastparen = rex->lastparen; ST.lastcloseparen = rex->lastcloseparen; ST.next_branch = next; - REGCP_SET(ST.cp); + + /* save state of captures at branch start */ + ST.cp = regcppush(rex\, 0\, maxopenparen); + REGCP_SET(ST.lastcp);

  /\* Now go into the branch \*/
  if \(has\_cutgroup\) \{

From looking at the existing code\, this appeared the right way to save all the captures. I realized that saving ALL the captures was overkill and very slow\, of course. I was initially only attempting to get the correct captures back from my test case without breaking any regression tests\, and this patch did succeed at doing that. (I was still studying the code to try to figure out a clean way to set a paren floor like CURLYX does.) I didn't see any point in keeping the original REGCP_SET() call here.

I did consider whether "case BRANCHJ" might need any separate changes\, but "case BRANCHJ" falls through to "case BRANCH"\, so this change seems to cover it.

Next\, I used regcp_restore() to restore the saved captures in "case BRANCH_next_fail"​:

@​@​ -8077\,8 +8079\,10 @​@​ NULL do_cutgroup = 0; no_final = 0; } - REGCP_UNWIND(ST.cp); - UNWIND_PAREN(ST.lastparen\, ST.lastcloseparen); + + /* restore state of captures from branch start */ + regcp_restore(rex\, ST.lastcp\, &maxopenparen); + scan = ST.next_branch; /* no more branches? */ if (!scan || (OP(scan) != BRANCH && OP(scan) != BRANCHJ)) {

From studying the existing code\, regcp_restore() appeared to be the right way to restore the saved captures without discarding them. The original REGCP_UNWIND() and UNWIND_PAREN() calls were clearly intended to undo the captures from the failed alternation\, but that approach seems to be unworkable for these edge cases involving captures inside repeated alternations. I suspect UNWIND_PAREN() might always have such edge cases\, but I haven't investigated that hunch.

Finally\, I mirrored both of these changes in "case TRIE" and "case TRIE_next_fail"\, for trie-optimized alternations​:

@​@​ -5920\,6 +5920\,10 @​@​ S_regmatch(pTHX_ regmatch_info *reginfo\, char *startpos\, regnode *prog) HV * widecharmap = MUTABLE_HV(rexi->data->data[ ARG( scan ) + 1 ]); U32 state = trie->startstate;

+ /* save state of captures at branch start */ + ST.cp = regcppush(rex\, 0\, maxopenparen); + REGCP_SET(ST.lastcp); + if (scan->flags == EXACTL || scan->flags == EXACTFLU8) { _CHECK_AND_WARN_PROBLEMATIC_LOCALE; if (utf8_target @​@​ -6067\,15 +6071\,10 @​@​ S_regmatch(pTHX_ regmatch_info *reginfo\, char *startpos\, regnode *prog) case TRIE_next_fail​: /* we failed - try next alternative */ { U8 *uc; - if ( ST.jump ) { - /* undo any captures done in the tail part of a branch\, - * e.g. - * /(?​:X(.)(.)|Y(.)).../ - * where the trie just matches X then calls out to do the - * rest of the branch */ - REGCP_UNWIND(ST.cp); - UNWIND_PAREN(ST.lastparen\, ST.lastcloseparen); - } + + /* restore state of captures from branch start */ + regcp_restore(rex\, ST.lastcp\, &maxopenparen); + if (!--ST.accepted) { DEBUG_EXECUTE_r({ Perl_re_exec_indentf( aTHX_ "%sTRIE failed...%s\n"\,

I didn't analyze ST.jump as fully as regcppush() and friends\, so I was a bit unsure about removing this ST.jump code\, but the regcp_restore() seemed to make it redundant. Since all the regression tests passed\, I included this change in the patch. Was it correct to remove that code?

Come to think of it\, perhaps I should have also removed this other ST.jump block near the beginning of "case trie_first_try"?

        if \( ST\.jump \) \{
            ST\.lastparen = rex\->lastparen;
            ST\.lastcloseparen = rex\->lastcloseparen;
            REGCP\_SET\(ST\.cp\);
        \}

Looking at it again\, this seems redundant too\, but I didn't think to try removing it when I developed the initial patch.

I'm not sure why the TRIE case was only dealing with the captures when ST.jump is set. Is it possible that captures might need to be restored when ST.jump isn't set?

A non jump trie node is a Boolean test\, meaning it shouldn't touch any capture buffers (except maybe if it's successful?) Whereas a jump trie is a Boolean test followed by arbitrary pattern\, so it might change the capture buffer state.

Consider

/(Foo(baz)|wh(oops))(zop)/

Versus

/(Foobaz|whoops)(zop)/

The first should be a jump trie\, the second a non-jump true.

The node just needs to store the capture index\, while the runtime code

in S_regmatch needs to keep track of the last seen WHILEM node\, presumably in a local var (that may need saving and restoring).

That makes sense. Mind if I take a stab at that\, if I get a chance?

With the current node types used by the various branch ops\, there is no spare space to store even an 8-bit index for the TRIE/TRIEC nodes\, since they make use of the 8-but flags field. (That field appears to be unused in BRANCH/BRANCHJ).

Yeah\, I came to the same conclusions about the flags too.

Which is why I asked Yves about the best way to extend such nodes. The naive way would be to use regnode_1 for BRANCH which adds an extra 32-bit field\, and 'upgrade' BRANCHJ from regnode_1 to regnode_2L; But the TRIE nodes are more complex than that and I don't think I could safely add an extra 32-bit field without breaking things.

This sounds like a viable option. Why do you call it naive?

The TRIE nodes are more complex\, but is there a particular reason you're wary of breaking things by adding a field to it?

I imagine because it is a product of peephole optimisation and thus is sometimes written directly over the data it replaces. Trie nodes actually have several forms and imagine changing the size could be troublesome but it should be possible.

If the flags can't be used\, I'm thinking that new regnode types make sense here. That way\, the regex compiler could trigger the save/restore functionality only when needed\, provide the correct paren floor and avoid increasing the size of compiled regexes that don't need to restore captures like this. It would take more work to implement\, of course\, and use up a few of the unused 8-bit regnode slots\, but it seems like it might be the most effective way to optimize this. What do you and Yves think about this idea?

I will try to find time to think about this but make no promises.

Cheers Yves

Thanks again for taking the time to work on this bug. (I hope I'm not wasting too much of your time on the conversation!)

Deven

Its not a waste of time at all\,\, it's just hard to find time to complement your efforts

p5pRT commented 5 years ago

From @khwilliamson

On 10/20/18 8​:04 AM\, demerphq wrote​:

If the flags can't be used\, I'm thinking that new regnode types make
sense here\.  That way\, the regex compiler could trigger the
save/restore functionality only when needed\, provide the correct
paren floor and avoid increasing the size of compiled regexes that
don't need to restore captures like this\.  It would take more work
to implement\, of course\, and use up a few of the unused 8\-bit
regnode slots\, but it seems like it might be the most effective way
to optimize this\.  What do you and Yves think about this idea?

I don't know much about this particular code area\, but I can say that we have enough regnode slots available that we don't need to worry about conserving them.

I will try to find time to think about this but make no promises.

Cheers Yves

Thanks again for taking the time to work on this bug\.  \(I hope I'm
not wasting too much of your time on the conversation\!\)

Deven

Its not a waste of time at all\,\, it's just hard to find time to complement your efforts

And\, it's good that you've learned about this code. Perhaps you'll consider tackling something else at some point. And maybe you could contribute some of the knowledge you've gain to commenting the source to make it easier for others who come along later.

deven commented 2 years ago

I haven't worked on this in a long time, but I stumbled across the following archive:

http://mirrors.develooper.com/perl/really-ancient-perls/oldperl/dist/fun.tar.gz

Inside this archive, there are two nested archives containing ancient versions of Perl:

-rw-r--r-- jhi/ftp     1000058 1994-08-26 13:21 src/5.0/5.000alpha/perl5.000a12h.tar.gz
-rwxr-xr-x jhi/ftp      924479 1994-10-07 17:58 src/5.0/5.000beta/perl5.000b3h.tar.gz

These particular versions are significant because the main perl5 repository unfortunately had to jump straight from perl-5.000alpha9 to perl-5.000 in a massive single commit, skipping the 31 intervening versions of Perl listed on the perlhist page between 5.000alpha9 and the final 5.000 release (which were unavailable at the time), and the two versions above are both in that range of missing versions.

I've added commits to my clone of the perl5 repository for both of these two versions:

I also ran the following command on my local repository, to graft a replacement commit for the original perl-5.000 commit, but it appears to be impossible to actually make a pull request for replacement graft commits like this.

git replace --graft a0d0e21ea6ea90a22318550944fe6cb09ae10cda 8be46f3624cc1174b59b786aea4676c88c443dec

Since this issue began somewhere in this range of 31 missing versions, I built perl-5.000alpha9 and reconfirmed that it does not exhibit the issue:

$ perl-5.000alpha9/perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'
a

After this, I built perl-5.000a12h and learned that it does exhibit the issue:

$ perl-5.000alpha12h/perl -e 'print "$2\n" if "afoobar" =~ /^((.)foo|bar)*$/;'
b

This narrows the range where the issue began down to one of the 15 versions from 5.000alpha10 through 5.000alpha12h...