Open p5pRT opened 13 years ago
It seems to me that\, if what precedes (*THEN) in a branch matches only a fixed string (no backtracking points)\, then the behaviour should be exactly the same as if (*THEN) is not present. Here is an example where that is not so:
Pattern: /^.*?(?(?=a)a|b(*THEN)c)/ Subject: ba Result: no match
Pattern: /^.*?(?(?=a)a|bc)/ Subject: ba Result: matches "ba"
I noticed this because I have just fixed the same bug in PCRE.
Philip Hazel
On Wed Jun 15 11:26:50 2011\, ph10@hermes.cam.ac.uk wrote:
It seems to me that\, if what precedes (*THEN) in a branch matches only a fixed string (no backtracking points)\, then the behaviour should be exactly the same as if (*THEN) is not present. Here is an example where that is not so:
Pattern: /^.*?(?(?=a)a|b(*THEN)c)/ Subject: ba Result: no match
Pattern: /^.*?(?(?=a)a|bc)/ Subject: ba Result: matches "ba"
I noticed this because I have just fixed the same bug in PCRE.
Do the pipes in the (?(...)...) condition expression count as regular alternations? It seems they donât. Should they?
The RT System itself - Status changed from 'new' to 'open'
On Sun\, 11 Sep 2011\, Father Chrysostomos via RT wrote:
On Wed Jun 15 11:26:50 2011\, ph10@hermes.cam.ac.uk wrote:
It seems to me that\, if what precedes (*THEN) in a branch matches only a fixed string (no backtracking points)\, then the behaviour should be exactly the same as if (*THEN) is not present. Here is an example where that is not so:
Pattern: /^.*?(?(?=a)a|b(*THEN)c)/ Subject: ba Result: no match
Pattern: /^.*?(?(?=a)a|bc)/ Subject: ba Result: matches "ba"
I noticed this because I have just fixed the same bug in PCRE.
Do the pipes in the (?(...)...) condition expression count as regular alternations? It seems they donât. Should they?
The documentation for THEN says that it tries "the next alternation in the innermost enclosing group". And it also says "Note that if this operator is used and NOT inside of an alternation then it acts exactly like the "(*PRUNE)" operator."
So yes\, I guess it all depends on whether or not the branches of a conditional subpattern count as regular alternations. My feeling is that they should\, so that all branches behave in the same way in regard to (*THEN). That is\, if (*THEN) is backtracked onto\, it skips any previous backtracking points in the current branch\, and then fails the branch. If there are other branches (in non-conditional groups)\, or other backtracking points prior to the current group\, they will be activated.
While thinking about this and experimenting\, I've just discovered another oddity of (*THEN).
Pattern: /a+?(*THEN)c/ Subject: aaac Result: Perl 5.012003 matches "aaac"
However\, PCRE matches only "ac". The same thing happens with (*PRUNE). PCRE is failing the entire match when backtracking onto (*THEN)\, and moving on to the next position in the subject. Perl seems to be backtracking to the a+? item. This seems not to be in accordance with the documentation for (*PRUNE).
Regards\, Philip
-- Philip Hazel
On Mon Sep 12 06:24:25 2011\, ph10@hermes.cam.ac.uk wrote:
On Sun\, 11 Sep 2011\, Father Chrysostomos via RT wrote:
On Wed Jun 15 11:26:50 2011\, ph10@hermes.cam.ac.uk wrote:
It seems to me that\, if what precedes (*THEN) in a branch matches only a fixed string (no backtracking points)\, then the behaviour should be exactly the same as if (*THEN) is not present. Here is an example where that is not so:
Pattern: /^.*?(?(?=a)a|b(*THEN)c)/ Subject: ba Result: no match
Pattern: /^.*?(?(?=a)a|bc)/ Subject: ba Result: matches "ba"
I noticed this because I have just fixed the same bug in PCRE.
Do the pipes in the (?(...)...) condition expression count as regular alternations? It seems they donât. Should they?
The documentation for THEN says that it tries "the next alternation in the innermost enclosing group". And it also says "Note that if this operator is used and NOT inside of an alternation then it acts exactly like the "(*PRUNE)" operator."
So yes\, I guess it all depends on whether or not the branches of a conditional subpattern count as regular alternations. My feeling is that they should\, so that all branches behave in the same way in regard to (*THEN). That is\, if (*THEN) is backtracked onto\, it skips any previous backtracking points in the current branch\, and then fails the branch. If there are other branches (in non-conditional groups)\, or other backtracking points prior to the current group\, they will be activated.
So you are saying that (?(condition)foo(*THEN)bar|baz) should jump out of the conditional group (since the |baz part is not a backtracking point\, but is only reached when the condition is false)\, but that (?(condition)foo(*THEN)bar) should fail the whole pattern (there being no |bar)?
I think it ends up being too confusing. The | in a conditional has nothing to do with regular alternation. That view is reinforced by the fact that only one pipe is permitted:
$ perl -le' /(?(1)foo|bar|baz)/' Switch (?(condition)... contains too many branches in regex; marked by \<-- HERE in m/(?(1)foo|bar| \<-- HERE baz)/ at -e line 1.
While thinking about this and experimenting\, I've just discovered another oddity of (*THEN).
Pattern: /a+?(*THEN)c/ Subject: aaac Result: Perl 5.012003 matches "aaac"
Thatâs strange. In 5.14 it doesnât match. I donât know which is worse.
However\, PCRE matches only "ac". The same thing happens with (*PRUNE). PCRE is failing the entire match when backtracking onto (*THEN)\, and moving on to the next position in the subject. Perl seems to be backtracking to the a+? item. This seems not to be in accordance with the documentation for (*PRUNE).
Regards\, Philip
On Sun\, 18 Sep 2011\, Father Chrysostomos via RT wrote:
So you are saying that (?(condition)foo(*THEN)bar|baz) should jump out of the conditional group (since the |baz part is not a backtracking point\, but is only reached when the condition is false)\, but that (?(condition)foo(*THEN)bar) should fail the whole pattern (there being no |bar)?
No\, I'm not.
I think it ends up being too confusing. The | in a conditional has nothing to do with regular alternation. That view is reinforced by the fact that only one pipe is permitted:
That is certainly true\, but to me\, as a simple-minded person\, it *looks* like a regular alternation. The only difference is that the matching engine just tries one of the alternatives rather than both. Consider
/^.*?(?(?=a)a(*THEN)b|c)/ Pattern
ac Subject
It starts off trying with zero matches of "a". The condition is true\, so
it matches a\, fails on b\, and then backtracks to (*THEN). In a "normal"
group it would try the next alternative\, but a conditional group behaves
as if there is only one alternative\, so it should just backtrack as if
the group had failed\, thereby trying again with one "a" matching .* and
so eventually succeeding.
I think the same should happen for this example:
/^.*?(?(?=a)a(*THEN)b)c/ ac
Further investigation shows up another issue. If (*THEN) appears in a regular (non-conditional) group that has no alternatives\, its effect again extends beyond the group.
/^.*?(a(*THEN)b)c/ aabc
Perl gives "no match"; PCRE currently matches. However\, if we give it a dummy alternative:
/^.*?(a(*THEN)b|z)c/
then Perl (5.012003) does match. That seems very counter-intuitive to me. Perhaps\, however this does tie in with the way Perl handles conditional groups\, since they seem to have the same behaviour.
The text in perlre for *THEN says "when backtracked into on failure\, it causes the regex engine to try the next alternation in the innermost enclosing group". It doesn't say what happens if there are no alternations or indeed if *THEN occurs in the final alternation. A check with
/^.*?(z|a(*THEN)b)c/
shows that Perl does match in this case too.
While thinking about this and experimenting\, I've just discovered another oddity of (*THEN).
Pattern: /a+?(*THEN)c/ Subject: aaac Result: Perl 5.012003 matches "aaac"
Thatâs strange. In 5.14 it doesnât match. I donât know which is worse.
I sometimes wonder whether these new backtracking verbs are going to prove more trouble than they are worth.
Regards\, Philip
-- Philip Hazel
On Mon\, Sep 19\, 2011 at 05:49:19PM +0100\, Philip Hazel wrote:
I sometimes wonder whether these new backtracking verbs are going to prove more trouble than they are worth.
About the only possibly useful input I think I can have after reading the thread of this bug\, and the entire documentation:
=item C\<(*THEN)> C\<(*THEN:NAME)>
This is similar to the "cut group" operator C\<::> from Perl 6. Like
C\<(*PRUNE)>\, this verb always matches\, and when backtracked into on
failure\, it causes the regex engine to try the next alternation in the
innermost enclosing group (capturing or otherwise).
Its name comes from the observation that this operation combined with the
alternation operator (C\<|>) can be used to create what is essentially a
pattern-based if/then/else block:
( COND (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ )
Note that if this operator is used and NOT inside of an alternation then
it acts exactly like the C\<(*PRUNE)> operator.
/ A (*PRUNE) B /
is the same as
/ A (*THEN) B /
but
/ ( A (*THEN) B | C (*THEN) D ) /
is not the same as
/ ( A (*PRUNE) B | C (*PRUNE) D ) /
as after matching the A but failing on the B the C\<(*THEN)> verb will
backtrack and try C; but the C\<(*PRUNE)> verb will simply fail.
=back
is that I'm sadly thinking that you're right about the trouble/worth trade.
I don't actually understand any of this. Which isn't a good sign\, as based on previous experience I'm going to make the possibly arrogant assumption that *I* am not the one at fault for the lack of understanding.
[Nothing uses (*THEN) in the core\, other than the 14 lines of tests for it]
Nicholas Clark
Nicholas Clark \nick@​ccl4\.org writes:
[Nothing uses (*THEN) in the core\, other than the 14 lines of tests for it]
Nor on CPAN\, other than various re::engine:: tests.
\<http://grep.cpan.me/?q=%5C%28%5C*THEN%5C%29>
-- ilmari "A disappointingly low fraction of the human race is\, at any given time\, on fire." - Stig Sandbeck Mathisen
On Tue\, 20 Sep 2011\, Nicholas Clark wrote:
About the only possibly useful input I think I can have after reading the thread of this bug\, and the entire documentation:
\
is that I'm sadly thinking that you're right about the trouble/worth trade.
I don't actually understand any of this. Which isn't a good sign\, as based on previous experience I'm going to make the possibly arrogant assumption that *I* am not the one at fault for the lack of understanding.
[Nothing uses (*THEN) in the core\, other than the 14 lines of tests for it]
I've been thinking about this some more. My naive understanding of *THEN is basically this: it is effectively just another way of doing what (?>...) does\, but with possibly simpler syntax and the added feature of *THEN:NAME. The emphasis on alternation is really a red herring. Thus\, if you have
A (*THEN) B
(where A and B are complex patterns) the matching engine\, having passed
(*THEN) and subsequently failed in B\, no longer backtracks into A. This
would be the same:
(?>A)B
If you go along with this\, it follows that\, if (*THEN) is within a group\, for example\,
C (A (*THEN) B)
then a failure in B must backtrack into C\, just as would happen with
C ((?>A) B)
In other words\, the effect of (*THEN) within a group does not propagate
back beyond the start of the group. IMHO this should also apply to
conditional groups which\, after all\, behave sort of like a group with
only one branch (just that there's a choice of which one each time).
Now\, it seems that Perl thinks differently to me. There seems to be the concept of "group with no alternation" and "group with alternation" as two different things that are handled differently. (And a conditional group is of the former type.) This was shown up by my example\, where adding what was in effect a dummy alternative to a group with only one branch caused a change in behaviour. It is a valid concept\, but I humbly submit that it is very confusing and unexpected. The fact that (*THEN) in these two examples behaves differently is\, to me at least\, counter-intuitive:
A (B (*THEN) C) A (B (*THEN) C | D)
In the first\, Perl fails the match without any backtracking if C fails; in the second\, it backtracks into A. Or that is what my experiments imply.
Regards\, Philip
PS On the matter of value/worth\, some of the other backtracking verbs behave oddly too\, though admittedly with silly examples:
Pattern: (*ACCEPT)a Subject: bax Result: Perl matches ""\, with "ax" as the remainder of the subject\, in other words\, the match point is after "b".
Pattern: (*ACCEPT) Subject: bax Result: Perl matches ""\, with "bax" as the remainder of the subject\, in other words\, the match point is at the start.
This is Perl 5.012003.
-- Philip Hazel
* Philip Hazel \ph10@​hermes\.cam\.ac\.uk [2011-09-21 11:20]:
I've been thinking about this some more. My naive understanding of *THEN is basically this: it is effectively just another way of doing what (?>...) does\, but with possibly simpler syntax and the added feature of *THEN:NAME. The emphasis on alternation is really a red herring. Thus\, if you have
A (*THEN) B
(where A and B are complex patterns) the matching engine\, having passed (*THEN) and subsequently failed in B\, no longer backtracks into A. This would be the same:
(?>A)B
That is how I understand it as well.
In which case the emphasis on alternation is not entirely a red herring\, as the question it tries to address is that both (*THEN) and (*PRUNE) could be considered the closing paren of an atomic group which differ in where the opening paren is inferred to be (start of pattern vs start of group) â which makes the most practical difference inside alternations\, where (*PRUNE) causes a very different effect from (*THEN). In most other cases the difference in matching is very much subtler.
It occurs to me that âTHENâ is really the wrong thing to call this predicate. It should be maybe âASLONGASâ. That is
A(*THEN)B
translates to something like âmatch A as long as B also matches from that point onâ.
Regards\, -- Aristotle Pagaltzis // \<http://plasmasturm.org/>
On Wed Sep 21 02:17:32 2011\, ph10@hermes.cam.ac.uk wrote:
The fact that (*THEN) in these two examples behaves differently is\, to me at least\, counter-intuitive:
A (B (*THEN) C) A (B (*THEN) C | D)
In the first\, Perl fails the match without any backtracking if C fails; in the second\, it backtracks into A. Or that is what my experiments imply.
Oddly\, I donât find that counterintuitive at all. Do we need three versions of prune/then?
On Wed\, 21 Sep 2011\, Father Chrysostomos via RT wrote:
A (B (*THEN) C) A (B (*THEN) C | D)
In the first\, Perl fails the match without any backtracking if C fails; in the second\, it backtracks into A. Or that is what my experiments imply.
Oddly\, I donât find that counterintuitive at all. Do we need three versions of prune/then?
Aha! One person's intuition is always another's totally craziness. :-) I wonder what the percentages each way would be if we surveyed the general Perl-using population? At least one other person thinks as I do\, because it was a bug report for PCRE - which was behaving more like Perl - that got me into this issue in the first place.
Is there a forum where we could ask the following question?
Folks\, consider the pattern ^A(B(*THEN)C)\, where A\, B\, and C are complex patterns. If matching fails in C\, do you expect that (a) the entire match should fail\, or (b) the matching should backtrack into A?
I will ask this question on the pcre-dev mailing list and see what answers (if any) I get. I might try Jeffrey Friedl as well. I will not be in the least offended if I am "outvoted".
It seems that you intuitively think that a group without a | is a different kind of animal to a group that contains a |\, whereas I don't. I just think that a group without a | has "one alternative"\, maybe better expressed as "one branch" (since "alternative" implies at least one of two).
I can\, however\, understand your logic; I have to say that to me it seems rather mathematically pedantic (with respect :-).
I'd rather not created yet another version of prune/then! I *thought* I understood these verbs. It seemed to me that they provide different "strengths" of pruning when backtracked onto\, as follows (from weakest to strongest):
*THEN fails the current alternation branch\, and restarts at the next alternation in the current group\, or fails the whole group if there are no more alternatives.
*PRUNE fails the current match\, but allows an advance to the next starting position (unless anchored).
*SKIP is like *PRUNE\, but can skip forward more than one character.
*COMMIT fails the entire matching process\, not allowing any further advance in the subject.
That's fairly straightforward; the issue between us is what constitutes "the current group". Having two verbs (one for me\, one for you) is just a recipe for even more confusion.
Regards\, Philip
-- Philip Hazel
On Wed Sep 21 12:04:45 2011\, ph10@hermes.cam.ac.uk wrote:
On Wed\, 21 Sep 2011\, Father Chrysostomos via RT wrote:
Oddly\, I donât find that counterintuitive at all. Do we need three versions of prune/then?
Aha! One person's intuition is always another's totally craziness. :-) I wonder what the percentages each way would be if we surveyed the general Perl-using population? At least one other person thinks as I do\, because it was a bug report for PCRE - which was behaving more like Perl - that got me into this issue in the first place.
Is there a forum where we could ask the following question?
Folks\, consider the pattern ^A(B(*THEN)C)\, where A\, B\, and C are complex patterns. If matching fails in C\, do you expect that (a) the entire match should fail\, or (b) the matching should backtrack into A?
I will ask this question on the pcre-dev mailing list and see what answers (if any) I get. I might try Jeffrey Friedl as well. I will not be in the least offended if I am "outvoted".
It seems that you intuitively think that a group without a | is a different kind of animal to a group that contains a |\, whereas I don't. I just think that a group without a | has "one alternative"\, maybe better expressed as "one branch" (since "alternative" implies at least one of two).
I can\, however\, understand your logic; I have to say that to me it seems rather mathematically pedantic (with respect :-).
I'd rather not created yet another version of prune/then! I *thought* I understood these verbs. It seemed to me that they provide different "strengths" of pruning when backtracked onto\, as follows (from weakest to strongest):
Come to think of it\, we already have your version of (*THEN)\, as (?>...). I know Iâm just repeating what you said by that. But it has just occurred to me that it belongs at the top of this list.
So your ^A(B(*THEN)C) translates into ^A((?>B)C).
My understanding of (*THEN) (and perlâs implementation)\, would be
[?>^A(B]C)
where Iâm using [?>...]\, as it doesnât nest. So the latter meaning seems more useful\, as there is no other way to get it.
*THEN fails the current alternation branch\, and restarts at the next alternation in the current group\, or fails the whole group if there are no more alternatives.
*PRUNE fails the current match\, but allows an advance to the next starting position (unless anchored).
*SKIP is like *PRUNE\, but can skip forward more than one character.
*COMMIT fails the entire matching process\, not allowing any further advance in the subject.
That's fairly straightforward; the issue between us is what constitutes "the current group". Having two verbs (one for me\, one for you) is just a recipe for even more confusion.
Regards\, Philip
On Wed\, 21 Sep 2011\, Father Chrysostomos via RT wrote:
Come to think of it\, we already have your version of (*THEN)\, as (?>...). I know Iâm just repeating what you said by that. But it has just occurred to me that it belongs at the top of this list.
So your ^A(B(*THEN)C) translates into ^A((?>B)C).
Sure ... but Perl has never shied away from having "more than one way to do it" has it? :-)
You will be please to hear that Jeff Friedl\, who has only just discovered (*THEN) and friends - I guess he's been doing other stuff - agrees with your and Perl's interpretation. He even pointed out that adding a non-matching alternation would change the behaviour. I haven't had any responses from the mailing list I asked. So on a sample of one third party\, it looks like my intuition is invalid.
At a practical level\, I am not sure it is feasible to change PCRE to behave like Perl\, so I may have to settle for documenting the difference. There are precedents (i.e. other differences).
Thanks for picking this up and having this discussion.
Regards\, Philip
-- Philip Hazel
On Fri Sep 23 02:15:05 2011\, ph10@hermes.cam.ac.uk wrote:
On Wed\, 21 Sep 2011\, Father Chrysostomos via RT wrote:
Come to think of it\, we already have your version of (*THEN)\, as (?>...). I know Iâm just repeating what you said by that. But it has just occurred to me that it belongs at the top of this list.
So your ^A(B(*THEN)C) translates into ^A((?>B)C).
Sure ... but Perl has never shied away from having "more than one way to do it" has it? :-)
You will be please to hear that Jeff Friedl\, who has only just discovered (*THEN) and friends - I guess he's been doing other stuff - agrees with your and Perl's interpretation. He even pointed out that adding a non-matching alternation would change the behaviour. I haven't had any responses from the mailing list I asked. So on a sample of one third party\, it looks like my intuition is invalid.
I wouldnât say itâs invalid\, as Perlâs documentation is a little unclear. I hope I have clarified that with commit ac9d848.
At a practical level\, I am not sure it is feasible to change PCRE to behave like Perl\, so I may have to settle for documenting the difference. There are precedents (i.e. other differences).
Thanks for picking this up and having this discussion.
Regards\, Philip
On Sun Sep 18 13:33:28 2011\, sprout wrote:
On Mon Sep 12 06:24:25 2011\, ph10@hermes.cam.ac.uk wrote:
another oddity of (*THEN).
Pattern: /a+?(*THEN)c/ Subject: aaac Result: Perl 5.012003 matches "aaac"
Thatâs strange. In 5.14 it doesnât match. I donât know which is worse.
The change occurred with this commit:
commit d1c771f5a95fddf225347623798f65884aa6eee7 Author: Bram \p5p@​perl\.wizbit\.be Date: Thu Aug 26 13:27:24 2010 +0200
VERB nodes in the regex engine should NOT be marked as JUMPABLE.
JUMPABLE nodes can be ignored during certain phases of regex execution\,
including ones where backtracking is affected. This change disables this
behviour so that the VERBS can perform their desired results.
Committer has taken the liberty of modifying the patch so that all
VERBS are jumped\, thus making the JUMPABLE expression a little simpler.
I have left Bram's change to JUMPABLE intact\, but inside of a comment
for now.
See discussion in thread for [perl #71942] *COMMIT bypasses optimisation
for futher details.
http://rt.perl.org/rt3/Ticket/Display.html?id=71942
There appears to be room for futher optimisation here
by moving the JUMPABLE logic to regex-compile time. Currently
it is arguable that the "optimisation" this patch seeks to avoid
is actually not an optimisation at all\, as it happens OVER AND OVER
during execution of a match\, thus the extra effort might actually
outweight the benefit\, especially on large strings.
COMMIT && OP(rn) != MARKPOINT && OP(rn) != SKIP && OP(rn) != CUTGROUP) || */\ (PL_regkind[OP(rn)] == CURLY && ARG1(rn) > 0) \ ) #define IS_EXACT(rn) (PL_regkind[OP(rn)] == EXACT) diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t ...snipped...
On Fri\, 23 Sep 2011\, Father Chrysostomos via RT wrote:
You will be please to hear that Jeff Friedl\, who has only just discovered (*THEN) and friends - I guess he's been doing other stuff - agrees with your and Perl's interpretation. He even pointed out that adding a non-matching alternation would change the behaviour. I haven't had any responses from the mailing list I asked. So on a sample of one third party\, it looks like my intuition is invalid.
I wouldnât say itâs invalid\, as Perlâs documentation is a little unclear. I hope I have clarified that with commit ac9d848.
Thanks.
No doubt that will eventually make it into a release and I'll see it. For your interest\, two people on the pcre-dev list have now responded to the query\, and they went for my interpretation. So at least I'm not alone\, but I would of course prefer PCRE to be Perl-compatible. I'm still brooding about this.
Regards\, Philip
-- Philip Hazel
On Sat\, Sep 24\, 2011 at 9:41 AM\, Father Chrysostomos via RT \< perlbug-followup@perl.org> wrote:
On Sun Sep 18 13:33:28 2011\, sprout wrote:
On Mon Sep 12 06:24:25 2011\, ph10@hermes.cam.ac.uk wrote:
another oddity of (*THEN).
Pattern: /a+?(*THEN)c/ Subject: aaac Result: Perl 5.012003 matches "aaac"
Thatâs strange. In 5.14 it doesnât match. I donât know which is worse.
According to the 5.14.1 docs\, it shouldn't match.
Note that if [the (*PRUNE)] operator is used and NOT inside of an alternation then it acts exactly like the (*PRUNE) operator.
Consider the pattern A (*PRUNE) B\, where A and B are complex patterns. Until the (*PRUNE) verb is reached\, A may backtrack as necessary to match. Once it is reached\, matching continues in B\, which may also backtrack as necessary; however\, should B not match\, then no further backtracking will take place\, and the pattern will fail outright at the current starting position. The change occurred with this commit:
The following should match according to the docs:
/a+(*THEN)c/
/a+?(?=c)(*THEN)c/
On Mon\, Sep 26\, 2011 at 11:03 PM\, Eric Brine \ikegami@​adaelis\.com wrote:
According to the 5.14.1 docs\, it shouldn't match.
Note that if [the (*PRUNE)] operator is used and NOT inside of an alternation then it acts exactly like the (*PRUNE) operator.
Oops\, that should be (*NEXT) at the beginning.
On Mon\, Sep 26\, 2011 at 11:04 PM\, Eric Brine \ikegami@​adaelis\.com wrote:
On Mon\, Sep 26\, 2011 at 11:03 PM\, Eric Brine \ikegami@​adaelis\.com wrote:
According to the 5.14.1 docs\, it shouldn't match.
Note that if [the (*PRUNE)] operator is used and NOT inside of an alternation then it acts exactly like the (*PRUNE) operator.
Oops\, that should be (*NEXT) at the beginning.
ARGH!
(*THEN)
On Mon Sep 26 20:04:14 2011\, ikegami@adaelis.com wrote:
On Sat\, Sep 24\, 2011 at 9:41 AM\, Father Chrysostomos via RT \< perlbug-followup@perl.org> wrote:
On Sun Sep 18 13:33:28 2011\, sprout wrote:
On Mon Sep 12 06:24:25 2011\, ph10@hermes.cam.ac.uk wrote:
another oddity of (*THEN).
Pattern: /a+?(*THEN)c/ Subject: aaac Result: Perl 5.012003 matches "aaac"
Thatâs strange. In 5.14 it doesnât match. I donât know which is worse.
According to the 5.14.1 docs\, it shouldn't match.
Note that if [the (*PRUNE)] operator is used and NOT inside of an alternation then it acts exactly like the (*PRUNE) operator.
Consider the pattern A (*PRUNE) B\, where A and B are complex patterns. Until the (*PRUNE) verb is reached\, A may backtrack as necessary to match. Once it is reached\, matching continues in B\, which may also backtrack as necessary; however\, should B not match\, then no further backtracking will take place\, and the pattern will fail outright at the current starting position.
*at the current starting position*
(*THEN) acts like (*COMMIT) in the example given above.
On Tue\, Sep 27\, 2011 at 1:22 AM\, Father Chrysostomos via RT \< perlbug-followup@perl.org> wrote:
On Mon Sep 26 20:04:14 2011\, ikegami@adaelis.com wrote:
On Sat\, Sep 24\, 2011 at 9:41 AM\, Father Chrysostomos via RT \< perlbug-followup@perl.org> wrote:
On Sun Sep 18 13:33:28 2011\, sprout wrote:
On Mon Sep 12 06:24:25 2011\, ph10@hermes.cam.ac.uk wrote:
another oddity of (*THEN).
Pattern: /a+?(*THEN)c/ Subject: aaac Result: Perl 5.012003 matches "aaac"
Thatâs strange. In 5.14 it doesnât match. I donât know which is worse.
According to the 5.14.1 docs\, it shouldn't match.
Note that if [the (*THEN)] operator is used and NOT inside of an alternation then it acts exactly like the (*PRUNE) operator.
Consider the pattern A (*PRUNE) B\, where A and B are complex patterns. Until the (*PRUNE) verb is reached\, A may backtrack as necessary to match. Once it is reached\, matching continues in B\, which may also backtrack as necessary; however\, should B not match\, then no further backtracking will take place\, and the pattern will fail outright at the current starting position.
*at the current starting position*
How can you check at what position it failed? I don't even know what that means (and this is in our documentation).
On Tue\, Sep 27\, 2011 at 2:44 PM\, Eric Brine \ikegami@​adaelis\.com wrote:
On Tue\, Sep 27\, 2011 at 1:22 AM\, Father Chrysostomos via RT \< perlbug-followup@perl.org> wrote:
On Mon Sep 26 20:04:14 2011\, ikegami@adaelis.com wrote:
On Sat\, Sep 24\, 2011 at 9:41 AM\, Father Chrysostomos via RT \< perlbug-followup@perl.org> wrote:
On Sun Sep 18 13:33:28 2011\, sprout wrote:
On Mon Sep 12 06:24:25 2011\, ph10@hermes.cam.ac.uk wrote:
another oddity of (*THEN).
Pattern: /a+?(*THEN)c/ Subject: aaac Result: Perl 5.012003 matches "aaac"
Thatâs strange. In 5.14 it doesnât match. I donât know which is worse.
According to the 5.14.1 docs\, it shouldn't match.
Note that if [the (*THEN)] operator is used and NOT inside of an
alternation then it acts exactly like the (*PRUNE) operator.
Consider the pattern A (*PRUNE) B\, where A and B are complex patterns. Until the (*PRUNE) verb is reached\, A may backtrack as necessary to match. Once it is reached\, matching continues in B\, which may also backtrack as necessary; however\, should B not match\, then no further backtracking will take place\, and the pattern will fail outright at the current starting position.
*at the current starting position*
How can you check at what position it failed? I don't even know what that means (and this is in our documentation).
Ok\, I understand what the docs are *trying* to convery.
The docs aren't just unclear\, they are self-contradictory. In two places\, even!
- Failing outright is exactly the opposite the failing for only one start position. - Allowing no further backtracking to take place is the opposite of allowing backtracking to try at a different start position.
On Mon\, 26 Sep 2011\, Father Chrysostomos via RT wrote:
*at the current starting position*
(*THEN) acts like (*COMMIT) in the example given above.
Aha.
Since we are quoting "man perlre"\, here's another bit of confusion:
"(?PARNO)" "(?-PARNO)" "(?+PARNO)" "(?R)" "(?0)" Similar to "(??{ code })" except it does not involve compiling any code\, instead it treats the contents of a capture buffer as an independent pattern that must match at the current position.
Consider this pattern (ignore white space):
^.*? (?1) c (?(DEFINE) (a(*THEN)b) )
When I read "an independent pattern"\, it suggests to me that the effect
of (*THEN) is confined to that independent pattern. However\, experiment
with Perl shows that I am wrong here as well. The pattern as it stands
does not match "aabc"\, but if |(*F) is added after b\, it does.
As it happens\, PCRE processes recursions/subroutines differently to Perl\, or perhaps I should say that the other way round\, since PCRE had them first. :-) So another difference probably doesn't matter much.
However\, I'm still thinking about this whole issue with regard to PCRE implementation. It tries to resolve where (*THEN) should back up to at compile time\, but looking at the example above\, it seems to me that it can only be done dynamically\, at run time.
Regards\, Philip
-- Philip Hazel
Sorry about the incredibly laggy reply. :-(
FC recently brought to my attention these threads\, which I had managed to overlook or forget or whatever.
I am doing my best to work my way through them and will reply as I go.
On 20 September 2011 18:32\, Philip Hazel \ph10@​hermes\.cam\.ac\.uk wrote:
On Tue\, 20 Sep 2011\, Nicholas Clark wrote: I've been thinking about this some more. My naive understanding of *THEN is basically this: it is effectively just another way of doing what (?>...) does\, but with possibly simpler syntax and the added feature of *THEN:NAME. The emphasis on alternation is really a red herring. Thus\, if you have
A (*THEN) B
(where A and B are complex patterns) the matching engine\, having passed (*THEN) and subsequently failed in B\, no longer backtracks into A. This would be the same:
(?>A)B
If you go along with this\, it follows that\, if (*THEN) is within a group\, for example\,
C (A (*THEN) B)
then a failure in B must backtrack into C\, just as would happen with
C ((?>A) B)
In other words\, the effect of (*THEN) within a group does not propagate back beyond the start of the group. IMHO this should also apply to conditional groups which\, after all\, behave sort of like a group with only one branch (just that there's a choice of which one each time).
Now\, it seems that Perl thinks differently to me. There seems to be the concept of "group with no alternation" and "group with alternation" as two different things that are handled differently. (And a conditional group is of the former type.)
I think you are right that perl thinks differently to you. Probably because of implementation details.
Perl's regex engine doesn't have a concept of a "group". It has a concept of an "alternation" (BRANCH)\, and the special behavior you associate with "group" parens perl actually associates with the BRANCH operator "|".
There is no group parens in the pattern /foo|bar|baz/ however there is an alternation.
The only part the group parens play is to tell the parser where the alternation starts and ends.
You could say that (?: | ) are some kind of really complicated ternary operator. IE\, not different constructs\, but optional part of a single operator.
Ditto for the association of groups to quantifiers. In reality the group parens are part of the quantifier and have no behavior of their own.
So\, when we have this:
(?:foo|bar)*
the (?: ) are part of the | operator in that they denote the beginning and end of the alternation\, and they are part of the * quantifier\, because they tell it what pattern that quantifier applies to.
Another aspect of this is that to perl /a(?:foo)b/ is the same as /afoob/. Since there is neither a quantifier on the (?:) nor a modifier in the ?: nor an alternation inside\, it is a complete no-op. In other words it is not the case that (?:foo) is a group with only one alternation. It is simply "foo".
So thinking of a (?: ... ) as a "thing" is wrong from the Perl perspective. It only becomes a thing when it has a quantifier attached\, or an alternation inside. Even when there is a modifier in the ?: it still isnt a thing\, it modifies the rules for interpreting the ....
IOW\, at an implementation level it is not that case that writing /foo|bar|baz/ is a short hand way of writing /(?:foo|bar|baz)/.
Consider this output:
$ perl -Mre=debug -e'/[ac]|[bc]/' Compiling REx "[ac]|[bc]" Final program: 1: BRANCH (13) 2: ANYOF[ac][] (25) 13: BRANCH (FAIL) 14: ANYOF[bc][] (25) 25: END (0) minlen 1 Freeing REx: "[ac]|[bc]"
$ perl -Mre=debug -e'/(?:[ac]|[bc])/' Compiling REx "(?:[ac]|[bc])" Final program: 1: BRANCH (13) 2: ANYOF[ac][] (26) 13: BRANCH (FAIL) 14: ANYOF[bc][] (26) 25: TAIL (26) 26: END (0) minlen 1 Freeing REx: "(?:[ac]|[bc])"
The program generated is the same. There is no operator for GROUP\, instead there is a chain of BRANCH operators ending in a TAIL.
(*THEN) affects how we transition from BRANCH to BRANCH\, and if we arent in a BRANCH acts like a (*PRUNE).
This gets hard to debug in Perl if you dont use charclasses with at least two chars in them\, as Perl will optimize to a TRIE.
Anyway\, I can see how it would be reasonable to consider (?:foo) to be an alternation with only one branch\, but to Perl it isnt\, and that is how you should interpret the verb operators.
The intent of the (*THEN) operator is for something like this:
/ 1111 ( (?:101)+ (*THEN) 000 | (?: 110 | 011 )+ (*THEN) 000 ) 0101 /x
If we match say 1000 '101's\, and we cannot match '000' no amount of backtracking will allow the '000' to match. Perl is not good at detecting this\, and therefore we allow the user to use the (*THEN) verb to tell Perl that the thing on the left side of the (*THEN) cant match this branch through backtracking\, and we should go to the next BRANCH to see what it would do. If there is no branch then it is the same thing as saying (*PRUNE)\, in that it effectively says that there is no match at this starting position.
Here is an example of how this can help. wc -l is useful but crude count of how much work the regex engine does for a given match. As you can see\, using the hints makes for a faster match.
$ for i in 1 2 4 8 16 32 64 128 256; do perl -Mre=Debug\,EXECUTE -le'("xx".("abcd" x $ARGV[0]). "gyy")=~/^x+ (?: [abcd]+ [pq] | (?:ab|cd)+ [xy] | (?:abcd)+ [gh] ) y+ /x;' $i 2>&1 | wc -l; done; 61 97 169 313 601 1177 2329 4633 9241 $ for i in 1 2 4 8 16 32 64 128 256; do perl -Mre=Debug\,EXECUTE -le'("xx".("abcd" x $ARGV[0]). "gyy")=~/^x+ (?: [abcd]+ (*THEN) [pq] | (?:ab|cd)+ (*THEN) [xy] | (?:abcd)+ (*THEN) [gh] ) y+ /x;' $i 2>&1 | wc -l; done; 55 77 121 209 385 737 1441 2849 5665
The behavior of /A+? (*THEN) BC / appeared to be broken in some cases. For instance for the literals A B and C\, we /should/ match "ABC"\, however we trigger the PREGf_SKIP optimization\, which basically turns /A+ B/ into /A+ (*SKIP) B/\, which is not correct for /A+? (*THEN) BC/. You can see this by adding (?{}) at the end of the pattern:
$ ./perl -le'"aaabc"=~/a+?(*THEN)bc/ and print "match: $&"' $ ./perl -le'"aaabc"=~/a+?(*THEN)bc(?{})/ and print "match: $&"' match: abc
The (?{}) disables this optimization\, and make it do the expected thing.
Your original bug report was about this:
./perl -Ilib -Mre=Debug\,All\,FLAGS -le'"ba"=~/^.*?(?(?=a)a|b(*THEN)c)/ and print "match: $&"'
which does not match\, probably because of an optimisation in .*? handling.
An unanchored match /P/ should match the same thing as /^.*?P/\, and we see that Perl does match for that form:
./perl -le'"ba"=~/((?(?=a)a|b(*THEN)c))/ and print "match: $&\ncond: $1"' match: a cond: a
And if we removed the non-greedy quantifier modifier:
./perl -le'"ba"=~/^.*((?(?=a)a|b(*THEN)c))/ and print "match: $&\ncond: $1"' match: ba cond: a
Both of which match as expected\, however this still doesnt match:
./perl -le'"ba"=~/^.*?((?(?=a)a|b(*THEN)c))/ and print "match: $&\ncond: $1"'
which I continue to investigate.
I have pushed b8f6efdd9cee0ae5cd4adf5a15aebb2984f57fda to fix the case of /a+?(*THEN)b/ but I have more to do.
I also pushed 337ff3078c4082e843af19536e11f70d3d14bfe9 which shows the intflags.
You need to use -Mre=Debug\,All\,FLAGS to see the flags. I might change that a bit later.
Yves
On Sat\, 22 Jun 2013\, demerphq wrote:
Sorry about the incredibly laggy reply. :-(
Hi Yves\,
No problem; glad somebody has picked this up. For myself\, I can no longer remember exactly what I was complaining about!
I think you are right that perl thinks differently to you. Probably because of implementation details.
Thanks for taking the time to give a long explanation of the way Perl works\, which indeed is different to the way PCRE works - and of course it's far too late to change that now.
In a recent release of PCRE we have formulated some reasonably consistent rules for way that backtracking verbs are handled\, and documented them. In many cases the behaviour is the same as Perl's\, but there are some differences.
Anyway\, I can see how it would be reasonable to consider (?:foo) to be an alternation with only one branch\, but to Perl it isnt\, and that is how you should interpret the verb operators.
PCRE does now get this (perhaps not 100%\, but to some extent) "right" in
its current treatment of (*THEN). The documentation even says this: "A
subpattern that does not contain a | character is just a part of the
enclosing alternative; it is not a nested alternation with only one
alternative."
Regards\, Philip
-- Philip Hazel
On 23 June 2013 13:18\, \ph10@​hermes\.cam\.ac\.uk wrote:
On Sat\, 22 Jun 2013\, demerphq wrote:
Sorry about the incredibly laggy reply. :-(
Hi Yves\,
No problem; glad somebody has picked this up. For myself\, I can no longer remember exactly what I was complaining about!
FC pointed out a list of VERB issues and I am trying to work through them and decide if they are buggy\, or ambiguous or whatnot.
Definitely some of things that have been reported have been bugs. A few have been interpretation issues.
I think you are right that perl thinks differently to you. Probably because of implementation details.
Thanks for taking the time to give a long explanation of the way Perl works\, which indeed is different to the way PCRE works - and of course it's far too late to change that now.
Indeed. Both projects have a very long history and some things are difficult to change.
In a recent release of PCRE we have formulated some reasonably consistent rules for way that backtracking verbs are handled\, and documented them. In many cases the behaviour is the same as Perl's\, but there are some differences.
Id like to hear about the differences. This thread makes me wonder how many are due to bugs in how the various optimizations interact and how many are due to other issues.
Anyway\, I can see how it would be reasonable to consider (?:foo) to be an alternation with only one branch\, but to Perl it isnt\, and that is how you should interpret the verb operators.
PCRE does now get this (perhaps not 100%\, but to some extent) "right" in its current treatment of (*THEN). The documentation even says this: "A subpattern that does not contain a | character is just a part of the enclosing alternative; it is not a nested alternation with only one alternative."
Cool. I will try to find time to read the PCRE docs and look for discrepancies and investigate.
Cheers\, Yves
-- perl -Mre=debug -e "/just|another|perl|hacker/"
On Sun\, 23 Jun 2013\, demerphq wrote:
Cool. I will try to find time to read the PCRE docs and look for discrepancies and investigate.
This is perhaps the relevant section of the "pcrepattern" man page:
More than one backtracking verb
If more than one backtracking verb is present in a pattern\, the one that is backtracked onto first acts. For example\, consider this pat- tern\, where A\, B\, etc. are complex pattern fragments:
(A(*COMMIT)B(*THEN)C|ABD)
If A matches but B fails\, the backtrack to (*COMMIT) causes the entire match to fail. However\, if A and B match\, but C fails\, the backtrack to (*THEN) causes the next alternative (ABD) to be tried. This behaviour is consistent\, but is not always the same as Perl's. It means that if two or more backtracking verbs appear in succession\, all the the last of them has no effect. Consider this example:
...(*COMMIT)(*PRUNE)...
If there is a matching failure to the right\, backtracking onto (*PRUNE) causes it to be triggered\, and its action is taken. There can never be a backtrack onto (*COMMIT).
Backtracking verbs in repeated groups
PCRE differs from Perl in its handling of backtracking verbs in repeated groups. For example\, consider:
/(a(*COMMIT)b)+ac/
If the subject is "abac"\, Perl matches\, but PCRE fails because the (*COMMIT) in the second repeat of the group acts.
Backtracking verbs in assertions
(*FAIL) in an assertion has its normal effect: it forces an immediate backtrack.
(*ACCEPT) in a positive assertion causes the assertion to succeed with- out any further processing. In a negative assertion\, (*ACCEPT) causes the assertion to fail without any further processing.
The other backtracking verbs are not treated specially if they appear in a positive assertion. In particular\, (*THEN) skips to the next alternative in the innermost enclosing group that has alternations\, whether or not this is within the assertion.
Negative assertions are\, however\, different\, in order to ensure that changing a positive assertion into a negative assertion changes its result. Backtracking into (*COMMIT)\, (*SKIP)\, or (*PRUNE) causes a neg- ative assertion to be true\, without considering any further alternative branches in the assertion. Backtracking into (*THEN) causes it to skip to the next enclosing alternative within the assertion (the normal be- haviour)\, but if the assertion does not have such an alternative\, (*THEN) behaves like (*PRUNE).
Backtracking verbs in subroutines
These behaviours occur whether or not the subpattern is called recur- sively. Perl's treatment of subroutines is different in some cases.
(*FAIL) in a subpattern called as a subroutine has its normal effect: it forces an immediate backtrack.
(*ACCEPT) in a subpattern called as a subroutine causes the subroutine match to succeed without any further processing. Matching then contin- ues after the subroutine call.
(*COMMIT)\, (*SKIP)\, and (*PRUNE) in a subpattern called as a subroutine cause the subroutine match to fail.
(*THEN) skips to the next alternative in the innermost enclosing group within the subpattern that has alternatives. If there is no such group within the subpattern\, (*THEN) causes the subroutine match to fail.
Regards\, Philip
-- Philip Hazel
@demerphq ping
Migrated from rt.perl.org#92898 (status was 'open')
Searchable as RT92898$