Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.99k stars 557 forks source link

(*SKIP) not triggering correctly #14979

Open p5pRT opened 9 years ago

p5pRT commented 9 years ago

Migrated from rt.perl.org#126327 (status was 'open')

Searchable as RT126327$

p5pRT commented 9 years ago

From 0perlbugs@rexegg.com

(*SKIP) should get triggered if the engine attempts to backtrack across it.

Perhaps due to internal optimizations\, (*SKIP) is not getting triggered in cases where backtracking across (*SKIP) is expected.

=== Case 1 === if ('aaaardvark aaardwolf' =~ /a{1\,2}(*SKIP)ard\w+/ ) { print "\$&=\"$&\"\n"; } # After failing to match the "r"\, an attempt to backtrack into the {1\,2} should trigger (*SKIP) # expected​: aaardwolf # matched​: aaardwark # note​: PCRE matches the expected aaardwolf\, as does Python's alternate "regex" Package

=== Case 2 === if ('aaaardvark aaardwolf' =~ /aa(*SKIP)ard\w+/ ) { print "\$&=\"$&\"\n"; } # matched​: aaardwark # This is more open to interpretation. Even though there is nothing to backtrack to the left of (*SKIP)\, a naive path exploration would cause the engine to backtrack to the beginning of the string\, triggering (*SKIP). This is the interpretation chosen by PCRE (even though internally it does not backtrack) as well as Python's alternate "regex" Package.

p5pRT commented 9 years ago

From 0perlbugs@rexegg.com

The Case 2 behavior is also inconsistent with the ever so popular (*SKIP)(*FAIL) construct\, where the engine fires the (*SKIP) even when there is nothing to backtrack to the left of it. For instance\,

if ('tatatiti' =~ /tata(*SKIP)(*FAIL)|.{4}/ ) { print "\$&=\"$&\"\n"; } # $&="titi" # this shows that (*SKIP) fired

p5pRT commented 9 years ago

From [Unknown Contact. See original ticket]

The Case 2 behavior is also inconsistent with the ever so popular (*SKIP)(*FAIL) construct\, where the engine fires the (*SKIP) even when there is nothing to backtrack to the left of it. For instance\,

if ('tatatiti' =~ /tata(*SKIP)(*FAIL)|.{4}/ ) { print "\$&=\"$&\"\n"; } # $&="titi" # this shows that (*SKIP) fired

p5pRT commented 9 years ago

From 0perlbugs@rexegg.com

Case 2 is also inconsistent with if ('123ABC' =~ /123(*SKIP)B|.{3}/ ) { print "\$&='$&'\n"; }

where (*SKIP) fires (correctly IMO) even though there is nothing to backtrack to the left of it\, eventually matching ABC

p5pRT commented 9 years ago

From [Unknown Contact. See original ticket]

Case 2 is also inconsistent with if ('123ABC' =~ /123(*SKIP)B|.{3}/ ) { print "\$&='$&'\n"; }

where (*SKIP) fires (correctly IMO) even though there is nothing to backtrack to the left of it\, eventually matching ABC

p5pRT commented 9 years ago

From @demerphq

On 12 October 2015 at 02​:40\, Rex \perlbug\-followup@​perl\.org wrote​:

# New Ticket Created by Rex # Please include the string​: [perl #126327] # in the subject line of all future correspondence about this issue. # \<URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=126327 >

(*SKIP) should get triggered if the engine attempts to backtrack across it.

Perhaps due to internal optimizations\, (*SKIP) is not getting triggered in cases where backtracking across (*SKIP) is expected.

Yes\, other optimizations kick in which mean that in some cases it does not even try the pattern.

=== Case 1 === if ('aaaardvark aaardwolf' =~ /a{1\,2}(*SKIP)ard\w+/ ) { print "\$&=\"$&\"\n"; } # After failing to match the "r"\, an attempt to backtrack into the {1\,2} should trigger (*SKIP) # expected​: aaardwolf # matched​: aaardwark # note​: PCRE matches the expected aaardwolf\, as does Python's alternate "regex" Package

The mandatrory minimal string in the pattern is aard. If we do not see an aard in the string then we do not even try the regex engine.

=== Case 2 === if ('aaaardvark aaardwolf' =~ /aa(*SKIP)ard\w+/ ) { print "\$&=\"$&\"\n"; } # matched​: aaardwark # This is more open to interpretation. Even though there is nothing to backtrack to the left of (*SKIP)\, a naive path exploration would cause the engine to backtrack to the beginning of the string\, triggering (*SKIP). This is the interpretation chosen by PCRE (even though internally it does not backtrack) as well as Python's alternate "regex" Package.

Again this an interaction with the minimal substring optimization. We jump directly to the 2nd char\, which is the first place the mandatory substring "aaard" is found.

When I originally implemented these directives I decided that they would NOT disable general optimizations. In a few cases I had been frustrated by (??{}) and (?{}) doing so\, and decided not to repeat the same for the backtracking verbs.

I think probably the bestway to fix this is to have a modifier flag which disables start position optimisations\, so people can opt in if they wish.

Alternatively\, maybe I just didnt make the right decision about which verbs should disable optimisations.

I was going to say that you can stick (??{ "" }) in your pattern to disable the required string optimisation\, but either I misremember that that used to work\, or something has changed with how that works.

I will try to follow up on this stuff when I get time.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 9 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 9 years ago

From 0perlbugs@rexegg.com

Hi Yves\,

Thank you very much for looking into this!

Let me preface this by saying that I find it wonderful that these verbs are even in the language. Thank you for this facility and all your hard work. A few weeks ago (*SKIP) and (*PRUNE) were picked up by Python's alternate `regex` package\, so their influence is slowly spreading.

I'm a little sad about the crippling of (*SKIP) by optimizations. Could this be a case where it's less important to save time by studying the pattern than to preserve the intent expressed by the pattern writer?

You mentioned two possible directions​: disabling optimizations for (*SKIP)\, or introducing a verb to do that. If you choose the second direction\, may I suggest (*NO_START_OPT) ? This would make it compatible with PCRE. This modifier is explained in this section about start-of-pattern modifiers.
http​://www.rexegg.com/regex-modifiers.html#pcre

Usually PCRE regex borrows from Perl\, but there have been occasions when the reverse has taken place​: http​://www.rexegg.com/pcre-documentation.html#perl_pcre

I'm preparing a long page to explain backtracking control verbs in the three engines that support them (Perl\, PCRE and to a lesser extent Python via the alternate regex package)\, and that's how I happened to notice these behaviors.

With many thanks and kindest regards\,

Rex