Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.98k stars 559 forks source link

(*COMMIT) behaviour when inside a repeated group #17074

Open p5pRT opened 5 years ago

p5pRT commented 5 years ago

Migrated from rt.perl.org#134254 (status was 'open')

Searchable as RT134254$

p5pRT commented 5 years ago

From ph10@hermes.cam.ac.uk

Created by ph10@cam.ac.uk

Please see the following example​:

$ perl -e 'if (abcd =~ /\A(?​:.(*COMMIT))*c/) { print "yes >$&\<\n"; } else { print "no \n"; }' yes >abc\<

Why does that pattern match? The repeated group should swallow all of "abcd"\, then fail to match "c" and so backtrack onto (*COMMIT)\, which should fail the whole match\, shouldn't it?

Philip

Perl Info ``` Flags: category=core severity=low Site configuration information for perl 5.30.0: Configured by builduser at Tue Jun 18 19:57:11 CEST 2019. Summary of my perl5 (revision 5 version 30 subversion 0) configuration: Platform: osname=linux osvers=5.1.8-arch1-1-arch archname=x86_64-linux-thread-multi uname='linux flo-64 5.1.8-arch1-1-arch #1 smp preempt sun jun 9 20:28:28 utc 2019 x86_64 gnulinux ' config_args='-des -Dusethreads -Duseshrplib -Doptimize=-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt -Dprefix=/usr -Dvendorprefix=/usr -Dprivlib=/usr/share/perl5/core_perl -Darchlib=/usr/lib/perl5/5.30/core_perl -Dsitelib=/usr/share/perl5/site_perl -Dsitearch=/usr/lib/perl5/5.30/site_perl -Dvendorlib=/usr/share/perl5/vendor_perl -Dvendorarch=/usr/lib/perl5/5.30/vendor_perl -Dscriptdir=/usr/bin/core_perl -Dsitescript=/usr/bin/site_perl -Dvendorscript=/usr/bin/vendor_perl -Dinc_version_list=none -Dman1ext=1perl -Dman3ext=3perl -Dcccdlflags='-fPIC' -Dlddlflags=-shared -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -Dldflags=-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now' hint=recommended useposix=true d_sigaction=define useithreads=define usemultiplicity=define use64bitint=define use64bitall=define uselongdouble=undef usemymalloc=n default_inc_excludes_dot=define bincompat5005=undef Compiler: cc='cc' ccflags ='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_FORTIFY_SOURCE=2' optimize='-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector-strong -fno-plt' cppflags='-D_REENTRANT -D_GNU_SOURCE -fwrapv -fno-strict-aliasing -pipe -fstack-protector-strong -I/usr/local/include' ccversion='' gccversion='9.1.0' gccosandvers='' intsize=4 longsize=8 ptrsize=8 doublesize=8 byteorder=12345678 doublekind=3 d_longlong=define longlongsize=8 d_longdbl=define longdblsize=16 longdblkind=3 ivtype='long' ivsize=8 nvtype='double' nvsize=8 Off_t='off_t' lseeksize=8 alignbytes=8 prototype=define Linker and Libraries: ld='cc' ldflags ='-Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -fstack-protector-strong -L/usr/local/lib' libpth=/usr/local/lib /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/include-fixed /usr/lib /lib/../lib /usr/lib/../lib /lib /lib64 /usr/lib64 libs=-lpthread -lgdbm -ldb -ldl -lm -lcrypt -lutil -lc -lgdbm_compat perllibs=-lpthread -ldl -lm -lcrypt -lutil -lc libc=libc-2.29.so so=so useshrplib=true libperl=libperl.so gnulibc_version='2.29' Dynamic Linking: dlsrc=dl_dlopen.xs dlext=so d_dlsymun=undef ccdlflags='-Wl,-E -Wl,-rpath,/usr/lib/perl5/5.30/core_perl/CORE' cccdlflags='-fPIC' lddlflags='-shared -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -L/usr/local/lib -fstack-protector-strong' @INC for perl 5.30.0: /usr/lib/perl5/5.30/site_perl /usr/share/perl5/site_perl /usr/lib/perl5/5.30/vendor_perl /usr/share/perl5/vendor_perl /usr/lib/perl5/5.30/core_perl /usr/share/perl5/core_perl Environment for perl 5.30.0: HOME=/home/ph10 LANG=en_GB.utf8 LANGUAGE=en_GB.utf8 LC_ALL=C LC_COLLATE=C LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/ph10/bin:/usr/local/sbin:/usr/local/bin:/usr/bin:/opt/android-sdk/platform-tools:/opt/android-sdk/tools:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl:/usr/sbin:.:/opt/android-sdk/platform-tools:/opt/android-sdk/tools:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 5 years ago

From @demerphq

On Tue\, 2 Jul 2019 at 11​:17\, Philip Hazel (via RT) \perlbug\-followup@&#8203;perl\.org wrote​:

# New Ticket Created by Philip Hazel # Please include the string​: [perl #134254] # in the subject line of all future correspondence about this issue. # \<URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=134254 >

To​: perlbug@​perl.org Reply-To​: ph10@​cam.ac.uk Subject​: (*COMMIT) behaviour when inside a repeated group From​: ph10@​cam.ac.uk Cc​: builduser Message-Id​: \5\.30\.0\_5771\_1562058560@&#8203;quercite

This is a bug report for perl from ph10@​cam.ac.uk\, generated with the help of perlbug 1.41 running under perl 5.30.0.

----------------------------------------------------------------- [Please describe your issue here]

Please see the following example​:

$ perl -e 'if (abcd =~ /\A(?​:.(*COMMIT))*c/) { print "yes >$&\<\n"; } else { print "no \n"; }' yes >abc\<

Why does that pattern match? The repeated group should swallow all of "abcd"\, then fail to match "c" and so backtrack onto (*COMMIT)\, which should fail the whole match\, shouldn't it?

Well I have to admit its a bit murky\, but I think the match should succeed. (*COMMIT) relates to how perl moves the start point on failure. This is an anchored match\, so it effectively has a (*COMMIT) at the very start anyway\, and the match does not fail at the given start point\, it succeeds. I think you are confusing atomic matches and (*COMMIT). Change that to /\A(?​:.(*COMMIT))*+c/ and it behaves as you say.

From the docs​:   It's a zero-width pattern similar to "(*SKIP)"\, except that when   backtracked into on failure it causes the match to fail outright.   No further attempts to find a valid match by advancing the start   pointer will occur again.

In this case technically we never backtracked into the COMMIT\, and we certainly did not attempt to advance the start pointer. So I don't think the commit should have fired. But I admit the other interpretation is not unreasonable. We have never really specified how quantifiers and verbs should interact\, and I could go along with an interpretation that says we are doing it wrong. Might need Larry to rule on it tho\, this was copied from Perl6.

Yves

p5pRT commented 5 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 5 years ago

From ph10@hermes.cam.ac.uk

On Tue\, 2 Jul 2019\, yves orton via RT wrote​:

Well I have to admit its a bit murky\, but I think the match should succeed. (*COMMIT) relates to how perl moves the start point on failure. This is an anchored match\, so it effectively has a (*COMMIT) at the very start anyway\, and the match does not fail at the given start point\, it succeeds.

Hmm. I can now see that there is an alternative interpretation of (*COMMIT)\, which is "If (*COMMIT) is passed during a match attempt\, carry on as normal\, but if the overall match fails\, do not advance the start pointer." However\, under that interpretation\, this match should succeed​:

$ perl -e 'if (abcd =~ /.*(*COMMIT)c/) { print "yes >$&\<\n"; } else { print "no \n"; }' no

Incidentally\, the \A at the start of my previous example is not relevant. Leaving it out makes no difference. So the question is why are these two different?

/.*(*COMMIT)c/ does not match "abcd" /(?​:.(*COMMIT))*c/ does match "abcd" (matches "abc")

Under your interpretation\, shouldn't the first one match? After all\, it can find a match without advancing the starting point.

From the docs​: It's a zero-width pattern similar to "(*SKIP)"\, except that when backtracked into on failure it causes the match to fail outright. No further attempts to find a valid match by advancing the start pointer will occur again.

In this case technically we never backtracked into the COMMIT\,

As an outsider looking at Perl\, I don't understand that. I thought that passing (*COMMIT) always established a backtrack point\, so (unless it was inside an atomic group) a subsequent failure must then backtrack onto it. How has this not happened?

We have never really specified how quantifiers and verbs should interact\, and I could go along with an interpretation that says we are doing it wrong. Might need Larry to rule on it tho\, this was copied from Perl6.

I await the ruling!

Regards\, Philip

-- Philip Hazel

p5pRT commented 5 years ago

From @demerphq

On Tue\, 2 Jul 2019 at 16​:44\, \ph10@&#8203;hermes\.cam\.ac\.uk wrote​:

On Tue\, 2 Jul 2019\, yves orton via RT wrote​:

Well I have to admit its a bit murky\, but I think the match should succeed. (*COMMIT) relates to how perl moves the start point on failure. This is an anchored match\, so it effectively has a (*COMMIT) at the very start anyway\, and the match does not fail at the given start point\, it succeeds.

Hmm. I can now see that there is an alternative interpretation of (*COMMIT)\, which is "If (*COMMIT) is passed during a match attempt\, carry on as normal\, but if the overall match fails\, do not advance the start pointer." However\, under that interpretation\, this match should succeed​:

$ perl -e 'if (abcd =~ /.*(*COMMIT)c/) { print "yes >$&\<\n"; } else { print "no \n"; }' no

No\, because this one definitely DOES backtrack into into the COMMIT op. Right after it fails to match 'c'.

Incidentally\, the \A at the start of my previous example is not relevant. Leaving it out makes no difference. So the question is why are these two different?

/.*(*COMMIT)c/ does not match "abcd" /(?​:.(*COMMIT))*c/ does match "abcd" (matches "abc")

Because the * is on a group\, the group is rolled back as a group\, so we never backtrack into the COMMIT in a technical sense\, instead we unroll the attempt to match the group as a whole. I am talking purely implementation here. Try add -Mre=debug and you will see what I mean in the two cases.

Under your interpretation\, shouldn't the first one match? After all\, it can find a match without advancing the starting point.

Again no\, because it backtracks into the (*COMMIT). I am not saying this is *right* by the way\, just that it is what is happening.

From the docs​: It's a zero-width pattern similar to "(*SKIP)"\, except that when backtracked into on failure it causes the match to fail outright. No further attempts to find a valid match by advancing the start pointer will occur again.

In this case technically we never backtracked into the COMMIT\,

As an outsider looking at Perl\, I don't understand that. I thought that passing (*COMMIT) always established a backtrack point\, so (unless it was inside an atomic group) a subsequent failure must then backtrack onto it. How has this not happened?

Because (?​:...)* is compiled into a construct called CURLYM which does the work of unrolling the group\, and we never actually enter the COMMIT regop via backtracking. You could argue it is an optimisation gone wrong. You could also argue that a commit in a quantifier doesnt make sense.

The problem I think is that there is a clash between the purist DFA interpretation of a star quantifier\, and the backtracking that we actually use.

Lets just take a simple case​:

"aaab"=~/a*x/

The rules dont really specify if we ever actually "fail to match" with a star quantifier.

You can say that "a*" matches the first three a's and then fails on the b and backtracks by one then allowing the x to match. And if you did you likely would match what actually happens in the regex engine. Roughly speaking this would be backtracking interpretation.

BUT

There is another equally valid interpretation that a* matches ONLY the first three a's and never even tries to match 'a' to 'x'. This would not match how the engine actually works in perl\, but is perfectly consistent with a purist DFA for the regex.

IOW\, we would have a state machine like this​:

/ Input state | a | x | anything-else 1 | 1 | 2 | 3 2 | ACCEPT 3 | FAIL

This machine would never backtrack on "aaab"=~/a*x/;

So at a certain level it doesn't surprise me that we show both behaviors\, they both are reasonable in themselves. Its just they aren't consistent with each other....

Yves

p5pRT commented 5 years ago

From ph10@hermes.cam.ac.uk

On Tue\, 2 Jul 2019\, yves orton via RT wrote​:

$ perl -e 'if (abcd =~ /.*(*COMMIT)c/) { print "yes >$&\<\n"; } else { print "no \n"; }' no

No\, because this one definitely DOES backtrack into into the COMMIT op. Right after it fails to match 'c'.

Yves\,

Thanks for taking the time to explain all this to me. So if /.*(*COMMIT)c/ backtracks onto the (*COMMIT)\, presumably /(.*(*COMMIT))c/ would also backtrack onto (*COMMIT) (it does). Seems surprising that /(.(*COMMIT))*c/ doesn't\, but hey\, what do I know? (This all actually arose from a PCRE user's question\, incidentally. I'm not devious enough to think of putting (*COMMIT) inside a quantifier. :-)

Because the * is on a group\, the group is rolled back as a group\, so we never backtrack into the COMMIT in a technical sense\, instead we unroll the attempt to match the group as a whole.

But surely there can be backtracks into other things in a group? How about something like "aaaab" ~ /(a+)+a/ which\, as I understand it\, will swallow all the a's in the first match of the group\, fail to match a second attempt at the group so go on to try the final "a"\, then fail again\, and so must backtrack into the group in order to match. But of course this is all implementation dependent\, and subject to optimizations as well.

I am talking purely implementation here. Try add -Mre=debug and you will see what I mean in the two cases.

I've tried to avoid learning anything about Perl internals :-) but thanks for that. I may have to...

Again no\, because it backtracks into the (*COMMIT). I am not saying this is *right* by the way\, just that it is what is happening.

Understood.

Because (?​:...)* is compiled into a construct called CURLYM which does the work of unrolling the group\, and we never actually enter the COMMIT regop via backtracking. You could argue it is an optimisation gone wrong. You could also argue that a commit in a quantifier doesnt make sense.

Indeed! These are are very pathological examples.

So at a certain level it doesn't surprise me that we show both behaviors\, they both are reasonable in themselves. Its just they aren't consistent with each other....

Indeed again.

Regards\, Philip

-- Philip Hazel