Open p5pRT opened 9 years ago
(*SKIP) should get triggered if the engine attempts to backtrack across it.
Perhaps due to internal optimizations\, (*SKIP) is not getting triggered in cases where backtracking across (*SKIP) is expected.
=== Case 1 === if ('aaaardvark aaardwolf' =~ /a{1\,2}(*SKIP)ard\w+/ ) { print "\$&=\"$&\"\n"; } # After failing to match the "r"\, an attempt to backtrack into the {1\,2} should trigger (*SKIP) # expected: aaardwolf # matched: aaardwark # note: PCRE matches the expected aaardwolf\, as does Python's alternate "regex" Package
=== Case 2 === if ('aaaardvark aaardwolf' =~ /aa(*SKIP)ard\w+/ ) { print "\$&=\"$&\"\n"; } # matched: aaardwark # This is more open to interpretation. Even though there is nothing to backtrack to the left of (*SKIP)\, a naive path exploration would cause the engine to backtrack to the beginning of the string\, triggering (*SKIP). This is the interpretation chosen by PCRE (even though internally it does not backtrack) as well as Python's alternate "regex" Package.
The Case 2 behavior is also inconsistent with the ever so popular (*SKIP)(*FAIL) construct\, where the engine fires the (*SKIP) even when there is nothing to backtrack to the left of it. For instance\,
if ('tatatiti' =~ /tata(*SKIP)(*FAIL)|.{4}/ ) { print "\$&=\"$&\"\n"; } # $&="titi" # this shows that (*SKIP) fired
The Case 2 behavior is also inconsistent with the ever so popular (*SKIP)(*FAIL) construct\, where the engine fires the (*SKIP) even when there is nothing to backtrack to the left of it. For instance\,
if ('tatatiti' =~ /tata(*SKIP)(*FAIL)|.{4}/ ) { print "\$&=\"$&\"\n"; } # $&="titi" # this shows that (*SKIP) fired
Case 2 is also inconsistent with if ('123ABC' =~ /123(*SKIP)B|.{3}/ ) { print "\$&='$&'\n"; }
where (*SKIP) fires (correctly IMO) even though there is nothing to backtrack to the left of it\, eventually matching ABC
Case 2 is also inconsistent with if ('123ABC' =~ /123(*SKIP)B|.{3}/ ) { print "\$&='$&'\n"; }
where (*SKIP) fires (correctly IMO) even though there is nothing to backtrack to the left of it\, eventually matching ABC
On 12 October 2015 at 02:40\, Rex \perlbug\-followup@​perl\.org wrote:
# New Ticket Created by Rex # Please include the string: [perl #126327] # in the subject line of all future correspondence about this issue. # \<URL: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=126327 >
(*SKIP) should get triggered if the engine attempts to backtrack across it.
Perhaps due to internal optimizations\, (*SKIP) is not getting triggered in cases where backtracking across (*SKIP) is expected.
Yes\, other optimizations kick in which mean that in some cases it does not even try the pattern.
=== Case 1 === if ('aaaardvark aaardwolf' =~ /a{1\,2}(*SKIP)ard\w+/ ) { print "\$&=\"$&\"\n"; } # After failing to match the "r"\, an attempt to backtrack into the {1\,2} should trigger (*SKIP) # expected: aaardwolf # matched: aaardwark # note: PCRE matches the expected aaardwolf\, as does Python's alternate "regex" Package
The mandatrory minimal string in the pattern is aard. If we do not see an aard in the string then we do not even try the regex engine.
=== Case 2 === if ('aaaardvark aaardwolf' =~ /aa(*SKIP)ard\w+/ ) { print "\$&=\"$&\"\n"; } # matched: aaardwark # This is more open to interpretation. Even though there is nothing to backtrack to the left of (*SKIP)\, a naive path exploration would cause the engine to backtrack to the beginning of the string\, triggering (*SKIP). This is the interpretation chosen by PCRE (even though internally it does not backtrack) as well as Python's alternate "regex" Package.
Again this an interaction with the minimal substring optimization. We jump directly to the 2nd char\, which is the first place the mandatory substring "aaard" is found.
When I originally implemented these directives I decided that they would NOT disable general optimizations. In a few cases I had been frustrated by (??{}) and (?{}) doing so\, and decided not to repeat the same for the backtracking verbs.
I think probably the bestway to fix this is to have a modifier flag which disables start position optimisations\, so people can opt in if they wish.
Alternatively\, maybe I just didnt make the right decision about which verbs should disable optimisations.
I was going to say that you can stick (??{ "" }) in your pattern to disable the required string optimisation\, but either I misremember that that used to work\, or something has changed with how that works.
I will try to follow up on this stuff when I get time.
Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
The RT System itself - Status changed from 'new' to 'open'
Hi Yves\,
Thank you very much for looking into this!
Let me preface this by saying that I find it wonderful that these verbs are even in the language. Thank you for this facility and all your hard work. A few weeks ago (*SKIP) and (*PRUNE) were picked up by Python's alternate `regex` package\, so their influence is slowly spreading.
I'm a little sad about the crippling of (*SKIP) by optimizations. Could this be a case where it's less important to save time by studying the pattern than to preserve the intent expressed by the pattern writer?
You mentioned two possible directions: disabling optimizations for (*SKIP)\, or introducing a verb to do that. If you choose the second direction\, may I suggest (*NO_START_OPT) ?
This would make it compatible with PCRE. This modifier is explained in this section about start-of-pattern modifiers.
http://www.rexegg.com/regex-modifiers.html#pcre
Usually PCRE regex borrows from Perl\, but there have been occasions when the reverse has taken place: http://www.rexegg.com/pcre-documentation.html#perl_pcre
I'm preparing a long page to explain backtracking control verbs in the three engines that support them (Perl\, PCRE and to a lesser extent Python via the alternate regex package)\, and that's how I happened to notice these behaviors.
With many thanks and kindest regards\,
Rex
Migrated from rt.perl.org#126327 (status was 'open')
Searchable as RT126327$