Repeated alternating named captures no longer capture reliably under perl v5.38.0

This is a bug report for perl from damian@conway.org, generated with the help of perlbug 1.43 running under perl 5.38.0.

Description

When two named captures in two distinct |-branches are sequentially matched inside a set of repeated parens the captures sometimes do not correctly populate the %+ hash, depending on the order in which they are matched.

Steps to Reproduce

    my $regex = qr{
        (?:
            (?<x> X )
        |   (?<y> Y )
        )+
    }xms;

    use Test::More;

    "XY" =~ $regex;
    is_deeply \%+, { x => 'X', y => 'Y' } => 'XY';

    "YX" =~ $regex;
    is_deeply \%+, { x => 'X', y => 'Y' } => 'YX';

    done_testing();

Expected behavior

Under perls v5.10 through v5.36, both matches correctly populate the %+ hash with both captures.

Under perl v5.38.0, only the second match correctly populates %+ with both captures. The first match throws away the first capture (i.e. of the 'X').

The same change in behaviour from v5.36 to v5.38 appears when named captures are replaced by numbered captures. In that case, the @{^CAPTURE} variable is not correctly populated under v5.38.0

This change breaks an idiom I frequently use to parse and label sequences of distinct components.

Flags

category=core

severity=high

Perl configuration


Site configuration information for perl 5.38.0:

Configured by damian at Tue Jul 4 16:52:21 AEST 2023.

Summary of my perl5 (revision 5 version 38 subversion 0) configuration:

Platform: osname=darwin osvers=21.6.0 archname=darwin-2level uname='darwin cassandra.local 21.6.0 darwin kernel version 21.6.0: thu jun 8 23:56:13 pdt 2023; root:xnu-8020.240.18.701.6~1release_arm64_t6000 arm64 ' config_args='-de -Dprefix=/Users/damian/perl5/perlbrew/perls/perl-5.38.0 -Aeval:scriptdir=/Users/damian/perl5/perlbrew/perls/perl-5.38.0/bin' hint=recommended useposix=true d_sigaction=define useithreads=undef usemultiplicity=undef use64bitint=define use64bitall=define uselongdouble=undef usemymalloc=n default_inc_excludes_dot=define Compiler: cc='cc' ccflags ='-fno-common -DPERL_DARWIN -mmacosx-version-min=12.6 -DNO_POSIX_2008_LOCALE -fno-strict-aliasing -pipe -fstack-protector-strong' optimize='-O3' cppflags='-fno-common -DPERL_DARWIN -mmacosx-version-min=12.6 -DNO_POSIX_2008_LOCALE -fno-strict-aliasing -pipe -fstack-protector-strong' ccversion='' gccversion='Apple LLVM 14.0.0 (clang-1400.0.29.202)' gccosandvers='' intsize=4 longsize=8 ptrsize=8 doublesize=8 byteorder=12345678 doublekind=3 d_longlong=define longlongsize=8 d_longdbl=define longdblsize=8 longdblkind=0 ivtype='long' ivsize=8 nvtype='double' nvsize=8 Off_t='off_t' lseeksize=8 alignbytes=8 prototype=define Linker and Libraries: ld='cc' ldflags =' -mmacosx-version-min=12.6 -fstack-protector-strong' libpth=/Library/Developer/CommandLineTools/usr/lib/clang/14.0.0/lib /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/lib /Library/Developer/CommandLineTools/usr/lib /usr/lib libs= perllibs= libc= so=dylib useshrplib=false libperl=libperl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs dlext=bundle d_dlsymun=undef ccdlflags=' ' cccdlflags=' ' lddlflags=' -mmacosx-version-min=12.6 -bundle -undefined dynamic_lookup -fstack-protector-strong'

@INC for perl 5.38.0: ./lib /Users/damian/perl5/perlbrew/perls/perl-5.38.0/lib/perl5/darwin-thread-multi-2level/ /Users/damian/perl5/perlbrew/perls/perl-5.38.0/lib/perl5/darwin-2level/ /Users/damian/perl5/perlbrew/perls/perl-5.38.0/lib/perl5/darwin-2level /Users/damian/perl5/perlbrew/perls/perl-5.38.0/lib/perl5 /Users/damian/perl5/perlbrew/perls/perl-5.38.0/lib/5.38.0/darwin-2level /Users/damian/perl5/perlbrew/perls/perl-5.38.0/lib/5.38.0 /Users/damian/perl5/perlbrew/perls/perl-5.38.0/lib /Users/damian/perl5/perlbrew/perls/perl-5.38.0/lib/site_perl/5.38.0/darwin-2level /Users/damian/perl5/perlbrew/perls/perl-5.38.0/lib/site_perl/5.38.0 /Users/damian/perl5/perlbrew/perls/perl-5.38.0/lib/5.38.0/darwin-2level /Users/damian/perl5/perlbrew/perls/perl-5.38.0/lib/5.38.0

Environment for perl 5.38.0: DYLD_LIBRARY_PATH (unset) HOME=/Users/damian LANG=en_AU.UTF-8 LANGUAGE (unset) LC_ALL=en_AU.UTF-8 LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/Users/damian/perl5/perlbrew/bin:/Users/damian/perl5/perlbrew/perls/perl-5.38.0/bin:/Library/Frameworks/Python.framework/Versions/3.7/bin:.:/usr/local/opt/rakudo-star/bin:/Applications/Rakudo/bin:/Applications/Rakudo/share/perl6/site/bin:/Users/damian/bin:/Users/damian/bin/macintosh:/Users/damian/perl5/bin:/opt/local/bin:/opt/local/sbin:/usr/local/bin:/bin:/usr/bsd:/usr/bin:/usr/sbin:/usr/local/sbin:/usr/local/git/bin:/Applications/Graphviz.app/Contents/MacOS:/Applications/MacGhostView.app/Contents/Resources/bin:/Applications/MacGhostView_old.app/Contents/Resources/bin:/opt/homebrew/bin:/opt/homebrew/Cellar/rakudo-star/2022.06/share/perl6/site/bin PERL5LIB=./lib:/Users/damian/perl5/perlbrew/perls/perl-5.38.0/lib/perl5/darwin-thread-multi-2level/:/Users/damian/perl5/perlbrew/perls/perl-5.38.0/lib/perl5/darwin-2level/:/Users/damian/perl5/perlbrew/perls/perl-5.38.0/lib/perl5:/Users/damian/perl5/perlbrew/perls/perl-5.38.0/lib PERL6LIB=./lib6:/Users/damian/lib/perl6 PERLBREW_HOME=/Users/damian/.perlbrew PERLBREW_MANPATH=/Users/damian/perl5/perlbrew/perls/perl-5.38.0/man PERLBREW_PATH=/Users/damian/perl5/perlbrew/bin:/Users/damian/perl5/perlbrew/perls/perl-5.38.0/bin PERLBREW_PERL=perl-5.38.0 PERLBREW_ROOT=/Users/damian/perl5/perlbrew PERLBREW_SHELLRC_VERSION=0.98 PERLBREW_VERSION=0.98 PERL_BADLANG (unset) PERL_CPANM_OPT=-M https://cpan.metacpan.org/ PERL_MB_OPT=--install_base /Users/damian/perl5/perlbrew/perls/perl-5.38.0/ PERL_MM_OPT=INSTALL_BASE=/Users/damian/perl5/perlbrew/perls/perl-5.38.0/ SHELL=/bin/tcsh

Bisect points to commit acababb42be12ff2986b73c1bfa963b70bb5d54e:

good - zero exit from ./perl -Ilib -w -e "XY" =~ /(?:(?<x>X)|(?<y>Y))+/ or die; $+{x} eq "X" or die
acababb42be12ff2986b73c1bfa963b70bb5d54e is the first bad commit
commit acababb42be12ff2986b73c1bfa963b70bb5d54e
Author: Yves Orton <demerphq@gmail.com>
Date:   Mon Jan 9 22:34:13 2023 +0100

    regexec.c - teach BRANCH and BRANCHJ nodes to reset capture buffers

    In /((a)(b)|(a))+/ we should not end up with $2 and $4 being set at
    the same time. When a branch fails it should reset any capture buffers
    that might be touched by its branch.

    We change BRANCH and BRANCHJ to store the number of parens before the
    branch, and the number of parens after the branch was completed. When
    a BRANCH operation fails, we clear the buffers it contains before we
    continue on.

    It is a bit more complex than it should be because we have BRANCHJ
    and BRANCH. (One of these days we should merge them together.)

    This is also made somewhat more complex because TRIE nodes are actually
    branches, and may need to track capture buffers also, at two levels.
    The overall TRIE op, and for jump tries especially where we emulate
    the behavior of branches. So we have to do the same clearing logic if
    a trie branch fails as well.

This looks like an intentional change, but not one I'd agree with intuitively. In "aba" =~ /((a)(b)|(a))+/, failure to match the first branch should not reset captures from a previous iteration.

failure to match the first branch should not reset captures from a previous iteration.

Agreed. The fundamental idea (i.e. that 'a' =~ /((a)(b)|(a))+/ should not set $2 is okay. But the resetting of capture variables should not reset successful captures from previous iterations. That breaks a highly useful and entirely reasonable behaviour.

Instead, when iterating a repeated alteration, all of the involved capture variables should have their values cached on entry, and then on exit have their values restored from cache (i.e. instead of simply reset to undef) unless those capture variables were changed in the successful branch.

Bisect points to commit acababb:

good - zero exit from ./perl -Ilib -w -e "XY" =~ /(?:(?<x>X)|(?<y>Y))+/ or die; $+{x} eq "X" or die
acababb42be12ff2986b73c1bfa963b70bb5d54e is the first bad commit
commit acababb42be12ff2986b73c1bfa963b70bb5d54e
Author: Yves Orton <demerphq@gmail.com>
Date:   Mon Jan 9 22:34:13 2023 +0100

    regexec.c - teach BRANCH and BRANCHJ nodes to reset capture buffers

    In /((a)(b)|(a))+/ we should not end up with $2 and $4 being set at
    the same time. When a branch fails it should reset any capture buffers
    that might be touched by its branch.

    We change BRANCH and BRANCHJ to store the number of parens before the
    branch, and the number of parens after the branch was completed. When
    a BRANCH operation fails, we clear the buffers it contains before we
    continue on.

    It is a bit more complex than it should be because we have BRANCHJ
    and BRANCH. (One of these days we should merge them together.)

    This is also made somewhat more complex because TRIE nodes are actually
    branches, and may need to track capture buffers also, at two levels.
    The overall TRIE op, and for jump tries especially where we emulate
    the behavior of branches. So we have to do the same clearing logic if
    a trie branch fails as well.

Bisection confirmed. @demerphq, can you take a look?

Any fix should be backported to perl-5.38.1.

I will take a look, but I don't consider it a bug, I consider it a bug fix. There are several plausible options for what the state of capture buffers contained in an alternation insider of a quantifier could be after a match, and unfortunately perl was inconsistent in what it did. In order to fix this consistency I had to pick one of the possible options, and I chose this way because I believe it allows more functionality and power

But the resetting of capture variables should not reset successful captures from previous iterations.

For me this being a "should" is not obvious, there are two perspectives here, 1. capture buffers inside of quantified groups should "accumulate" over iterations, or 2. capture buffers should be exclusionary, with only the buffers set in the matching alternation being set at the end of the alternation.

At least some people (me included obviously) consider it bizarre that after:

"abcaba"=~/(?:(a)(b)(c)|(a)(b)|(a))+/

that $1-$6 are all defined, many people consider it more reasonable that after an iteration only the captures that actually matched are set, and the rest are unset. Consider code like:

"abcabay"=~/^(?:(a)(b)(c)|(a)(b)|(a))+(?(1)x|y)$/

I think that the most logical conclusion is that this should match as should "abay", "aby", "ay". In fact the only time we should match 'X' after the group is when "abc" was matched in the last iteration.

The "accumulator" behavior seems to be less helpful, over many iterations of the quantifier every alternation it contains might end up populated, with bits of string from different iterations, which IMO isn't very helpful. With the current behave the contents of capture buffer in the quantifier are consistent, the show the state from the last iteration of the quantifier.

That breaks a highly useful and entirely reasonable behaviour.

Examples of what you have in mind would be helpful to understanding your position here.

Instead, when iterating a repeated alteration, all of the involved capture variables should have their values cached on entry, and then on exit have their values restored from cache (i.e. instead of simply reset to undef) unless those capture variables were changed in the successful branch.

I think at least some people find it weird that

"a"=~/((a)(b)(c)|(a)(b)|(a))/

would leave only $6 set, but, "abcaba" would leave $1, $2, $3 set.

FWIW, notice the inconsistencies in the following:

$ cat t.pl
for ("a","ab","aba","abca","abcaba") {
    printf "%-10s -> %s-%s-%s-%s-%s-%s\n",
        $_, $1||"1",$2||"2",$3||"3",$4||"4",$5||"5",$6||"6"
        if /(?:(a)(b)(c)|(a)(b)|(a))+/;
}

$ perl t.pl
a          -> 1-2-3-4-5-a
ab         -> 1-2-3-a-b-6
aba        -> a-2-3-a-b-a
abca       -> a-b-c-4-5-a
abcaba     -> a-b-c-a-b-a

I think it makes most logical sense to say "the state of the capture buffers after an iteration of a quantifier around them reflect the most recent iteration of that quantifier", and I think that doing so leads to the least contradictions and the least constraints on how we can optimize the regex engine.

I think that if capture buffers are to accumulate then after

"a"=~/(a)(b)(c)|(a)(b)|(a)/

all of $1, $4, and $6 should be set to "a". I think there is a logical contradiction if we say that this should leave $6 set, but that "aba" should leave $4,$5 and $6 set, but not leave $1 and $2 set. It is also more difficult to implement.

Another point where this becomes difficult:

" aabbaa"=~/^(aa(bb)?)+/

What should $1 and $2 be here? I think most people would expect that $1 = "AA" and $2 = "" (or undef). But at the same time by definition X? is equivalent to (?:X|), so (bb)? should be equivalent to (?:(bb)|), but if I understand your proposed rules right they would require $2 to "bb" here...

So I think the new behavior is the only behavior that is completely logically consistent with other desired semantics.

I'm happy to provide a real-world example.

In fact here is (a very simplified version of) the real-world example I had been actually using, and which broke under v5.38, leading to this issue report.

Consider the following code for parsing subroutine declarations:

    my $DECLARATION = qr{

            (?<keyword>   sub           )  \s+
            (?<subname>   [^\W\d]\w*    )  \s+

        (?: (?<common>    :common   \b  )  \s+
          | (?<override>  :override \b  )  \s+
        )+

            (?<params>    \( .* \)      )  \s*
            (?<body>      \{ .* \}      )

    }xms;

    my $code = 'sub foo :override :common ($param) { body() }';

    if ($code =~ $DECLARATION) {
        my %matched = %+;

        if ($matched{common})   {...}

        if ($matched{override}) {...}

        # etc.
    }

When the regex matches, the resulting %+ contains:

    (
        keyword  => 'sub',
        subname  => 'foo',
        common   => ':common',
        override => ':override',
        params   => '($param)',
        body     => '{ body() }',
    )

...which usefully extracts and labels all the information I need.

However, under v5.38, if the code string had been instead:

    my $code = 'sub foo :common :override ($param) { body() }';

...then %+ would contain only:

    (
        keyword  => 'sub',
        subname  => 'foo',
        override => ':override',
        params   => '($param)',
        body     => '{ body() }',
    )

...which does not correctly extract all the information from the code string (it “loses” the common => ':common' entry)...and hence leads to incorrect behaviour in my parser.

Thus, the technique of detecting and extracting multiple alternatives by name no longer works from v5.38, and with this change in regex behaviour a useful parsing technique – one that is actually used in the real-world – has been lost.

In fairness, I should mention that @demerphq was kind enough to contact me before making this change, to discuss the proposal and ask if I could see any adverse consequences it might have. It is entirely my own fault that I did not recognize this highly adverse consequence at that time.

If I have to live with it hereafter, then I only have myself to blame.

However, it is hard to see how the new behaviour is any less confusing than the old. Certainly it took me some considerable time to work out why my regex had "broken" under v5.38. It's very hard to accept that the failed match of a branch really should invalidate a previous successful match-and-capture by that branch. From a regex-user's perspective, it's hard to accept that ((a)|(b))+ should correctly parse-and-extract 'ba', but not correctly parse-and-extract 'ab'. I understand why it does; but it feels more brittle and error-prone than the old behaviour. At least, that's how it feels for alternations where each branch is mutually exclusive.

I don't believe that the new behaviour really is any more consistent than the old behaviour (because no behaviour could truly be consistent, given the vast range of use-cases and user expectations here).

On the other hand, the old behaviour was far more useful (to me, at least) than the new behaviour is.

For what it's worth, my original expectation (from way back) was that we would retain only those results from the last successful match, so that ((a)|(b))+ would never return captures of both a and b. While I accept that horse has long bolted and burnt the stable door behind it, it still feels to me like one of the few approaches for which there would be a hope of consistency.

I wonder how hard it would be to have a new scoped mode that would capture an array's worth of matches from each successful iteration, such that "badabbc" =~ /(?:(a)|(b))*bc/ would capture ["abbc", ["a",undef], [undef, "b"]] (perhaps in some new @^ALL_CAPTURE, %^ALL_CAPTURE if it is too difficult to shoehorn into existing variables). That would make a whole lot of things possible.

would never return captures of both a and b.

That is how 5.38 works.

While I accept that horse has long bolted and burnt the stable door behind it,

IMO there is enough inconsistency in how these types of pattern behave that the stable door is still available to us. IIRC this ticket discusses behavior that changes depending on whether we use CURLYX or CURLYM. See 05b13cf680588a26de64f13d2b3be385e17624bc

it still feels to me like one of the few approaches for which there would be a hope of consistency.

Agreed. As far as making the different parts of the regex engine consistent with each other I think this is the only option. As far as consistency with peoples expectations however.... :-( It seems there are people who expect both semantics.

I wonder how hard it would be to ... [capture into array structures]

I suspect it is annoyingly difficult, as we would have to unroll these updates when we backtrack. FWIW, I fixed some gnarly backtracking bugs with logic related to the one that is under discussion in this ticket. Its been long enough since I worked on this I forget the details, but the patch sequence that included the fix under discussion here in this thread also included fixes for some gnarly backtracking issues. See 59db194299c94c6707095797c3df0e2f67ff82b2 and 38508ce8fc3a1bd12a3bb65e9d4ceb9b396a18db (the latter is more relevant)

@thoughtstream: I will reply to your post soon, but i want to give it more attention than I have time at the moment.

((a)|(b))+ would never return captures of both a and b.

That is how 5.38 works.

This doesn't seem to be true.

$ perl -E' "ba" =~ /((a)|(b))+/; say "$2 $3"; '
a b

((a)|(b))+ would never return captures of both a and b.

That is how 5.38 works.

This doesn't seem to be true.

HRMPH. I consider that to be a bug. The intention of my changes in acababb42be12ff2986b73c1bfa963b70bb5d54e and related patches were to make it so that after an alternation capture buffers from exactly one branch would be defined. There was a lot of churn on that patch sequence as I tried to get it reviewed, and blah blah, and it seems some bugs slipped through. Very very annoying.

I wonder how hard it would be to ... [capture into array structures]

I suspect it is annoyingly difficult, as we would have to unroll these updates when we backtrack.

Well we already do that for $1 et al, so it should be just a question of hooking into the appropriate functions/macros. The annoyingly difficult part would presumably be managing the bookkeeping correctly to avoid leaks (or crashes) - being able to store offsets rather than full SVs might make that at least a little easier.

However I'd also see this as an edge case that doesn't necessarily need to be super-optimized so can afford to be written very simply and pedantically to focus on correctness, as long as checking whether we're in this mode is fast so that it doesn't slow down the standard path.

Perl / perl5

Repeated alternating named captures no longer capture reliably under perl v5.38.0 #21563

severity=high