Closed oleedd closed 1 month ago
pcre2test:
re> /\w(?R)*\w/
data> grtgt
0: grtg
Looks correct.
Hm, I used https://regex101.com/
Got info that it uses 10.42. How is it in 10.42?
The steps are from that site as well. Is step 16 really backtrack of 3 symbols (for \w(?R)?+\w
)?
The latest release is 10.44. Anyway, I don't know how PCRE is used on that site, I suspect they emulating engines with a JavaScript implementation.
No, it is not possible. It uses a server with PCRE2.
Can you please check with 10.42?
What is step 16 for \w(?R)?+\w
with 10.44? Is this correct?
10.42:
re> /\w(?R)*\w/
data> grtgt
0: gr
It looks like there was a bug which was fixed since. You should ask that site to use the latest pcre2.
Probably this fix in 10.43:
Ok, What about my question about step 16?
What is step 16? If this is something connected with regex101 then you need to ask its managers.
PCRE2 should have a debugger or tracing which records all steps. Because it is only available for PCRE on regex101.
I wouldn't be surprised if regex101 "simply" used PCRE2_AUTO_CALLOUT
Probably yes. I am waiting for updating it to check steps and will return to close or ask. Because that backtrack of 3 symbols looks related to that fixed bug.
Running pcre2test with the -ac option (auto-callout) shows the progress of the match.
Bad that no binary version of pcre2test to try.
It was updated. But step 16 is the same and I don't understand it.
\w(?R)?+\w
for "gtykr". The first \w
is matched 5 times. 1 symbol backtrack because no second \w
for "r", and "r" becomes second \w
for "k". After that no second \w
for "y" and 3 symbols backtrack.
Why not 2 symbols backtrack, why 3?
Your pattern matches two "word" characters, optionally containing, between them, a recursion, that is, two "word" characters, optionally.... in other words, it matches an even number of "word" characters. This is exactly what it does, and it is the same result given by Perl.
PCRE2 version 10.44 2024-06-07 (8-bit) /(\w)((?R)?+)(\w)/ gtykr 0: gtyk 1: g 2: ty 3: k
I know. But I ask about steps - about the backtrack in step 16.
I do now know anything about "step 16", which I presume is some artefact of regex 101. But analysing this match
/(\w)((?R)?+)(\w)/ gtykr 0: gtyk 1: g 2: ty 3: k
goes as follows:
LEVEL 0: \w matches "g"; sets $1 to "g"; ?R recurses (remaining subject is "tykr") LEVEL 1: \w matches "t"; ?R recurses ("ykr") LEVEL 2: \w matches "y"; ?R recurses ("kr") LEVEL 3: \w matches "k"; ?R recurses ("r") LEVEL 4: \w matches "r"; ?R recurses ("") LEVEL 5: \w fails to match; backtrack causes recursion to fail LEVEL 4: (?R)? is optional, so zero match is OK; ("") LEVEL 4: the second \w fails to match; whole recursion fails LEVEL 3: the second \w matches "r" so the recursion succeeds ("") LEVEL 2: the second \w fails; recursion fails LEVEL 1: the second \w succeeds, matching "y"; sets $2 to "ty"; recursion succeeds LEVEL 0: the second \w succeeds, matching "k"; sets $3 to "k"; the whole pattern succeeds, matching "gtyk"
At least, that's how I see it. I note that Perl behaves exactly the same way. It is the mixture of succeeding and failing recursions that causes differences in backtracking in the subject.
LEVEL 2: the second \w fails; recursion fails
That is step 16. Why recursion fails? Why doesn't the second \w
match "k"? "kr" may be unmatched.
It is obvious why it fails if you check what Philip wrote.
It says: LEVEL 3: the second \w matches "r" so the recursion succeeds ("") -> we are already at the END of the string, since r is the LAST letter in the input, and whatever non-zero width item comes after that, we cannot match. LEVEL 2: the second \w fails; recursion fails -> a \w cannot succeed at the END of the string
Yes, it cannot succeed at the END of the string. But why no removing "kr" and matching "k" by the second \w
after that? That is what I asked.
There is no removal operation in regex engines, only backtracking. Backtracking is kind of a status reset, rather than an actual operation. I suspect you have a fundamentally different concept of how engines work that they actually work, so I cannot answer your question.
I meant bracktracking by removing.
I will try to explain one more time what is happening. Lets call the second \w as \w' to make it different from the first \w. They are identical otherwise. First "g" is matched to \w, then we setup a reset point and restart the pattern. Then "t" is matched to \w, setup a reset point and restart the pattern. Then "y" is matched to \w, and from that point we cannot achieve a match whatever we do, since a match requires at least four characters (\w and three \w'-s). Then we backtrack to "t", match the "y" as \w', and "k" as another \w'.
Yes, there is no forth character. But it can backtrack "kr" and the second \w
will match "k". Why doesn't it do this 2 symbol backtrack? There is still no answer to this. My question is and was very concrete (why not 2 but 3). It is only about this one step.
We are here:
LEVEL 2: the second \w fails; recursion fails
The second \w
fails because no more characters. But after that level 3 ("kr") may be backtracked, and the second \w
(level 2) will match "k".
This question is answered many times. The engine uses backtracking, not some random movement around the source. The engine NEVER retries anything starting from "k", because there are no alternatives there. The "k" is matched to the first \w, and there is no alternative to a property match. If you would change the first \w to (\w|somethingelse)
then the "somethingelse" would be retried.
It wasn't answered. A real answer should be like this: it does 3 symbols backtrack, not 2 because...
No more characters is not a reason! Because "kr" may be backtracked, and the second \w
(level 2) will match "k".
If without +
, it does 2 symbols backtrack there as expected, not 3 (and "k" is matched by the second \w
), but I don't see a reason why +
influences so. If level 2 was completed (4 symbols), then yes, it couldn't backtrack it because of +
, but it wasn't completed.
Maybe I don't know something about +
. As I know, it prevents backtracking of something that was completely matched (level 2 wasn't completely matched). Not only this?
If I use an atomic group instead of +
, it goes correctly: 2 symbols backtrack ("kr") there and the second \w
matches "k".
\w(?>(?R))?\w
It finds "gt" and "yk". It is expected for me.
But with +
instead, it backtracks 3 symbols instead of 2 and finds "gtyk". So it is weird for me. I don't understand why. The difference is in that backtrack: 3/2 symbols. It seems that it should work the same.
/\w(?R)?+\w/
is the same as /\w(?>(?R)?)\w/
. It also behaves differently. There is still no "backtracks 3 symbols", only it reverts to a previous state, and tries a different path.
Btw this pattern is funny:
data> aa
0: aa
data> aaa
0: aa
data> aaaa
0: aa
data> aaaaa
0: aaaa
data> aaaaaa
0: aaaaaa
data> aaaaaaa
0: aaaaaa
data> aaaaaaaa
0: aa
data> aaaaaaaaa
0: aaaa
data> aaaaaaaaaa
0: aaaaaa
data> aaaaaaaaaaa
0: aaaaaaaa
data> aaaaaaaaaaaa
0: aaaaaaaaaa
data> aaaaaaaaaaaaa
0: aaaaaaaaaaaa
data> aaaaaaaaaaaaaa
0: aaaaaaaaaaaaaa
data> aaaaaaaaaaaaaaa
0: aaaaaaaaaaaaaa
data> aaaaaaaaaaaaaaaa
0: aa
data> aaaaaaaaaaaaaaaaa
0: aaaa
...
I still don't understand why \w(?R)?+\w
reverts to position 2 ("y"), not to position 3 ("k") like \w(?>(?R))?\w
does.
X?+ is the same as (?>X?) which is NOT the same as (?>X)? because in that case the ? is not part of the atomic group. I am sorry we don't seem to be able to explain to you how this all works. X?+ can match X one or zero times. As it is greedy, the engine tries once. If that matches, but something later does not, there cannot be a backtrack to try zero times because of the + but without the + it could backtrack and try zero times (that is, skip the X altogether). When there is a backtrack, everything goes back to what it was at the point the backtrack was remembered. When level 3 has matched "kr" and level 2 fails, it goes back to level 1, which has the position at "ykr" because all it has matched is "t".
I did some experiments and understood why.
I thought that with +
, it can't be backtracked but can be fully unmatched if a previous group (level 2 in our case) can't be matched (previous items have more high priority). But it looks like it can be unmatched only with something that includes it.
It is not obvious because it could work differently.
\w(?R)*\w
and\w(?R)?\w
for "grtgt" returns "gr" and "tg". They should find "grtg" as with "regex" for Python and JavaScript (with repeating regex many times instead of(?R)
).Additionally: why is step 16 for
\w(?R)?+\w
on the same string ("grtgt") backtrack of 3 symbols, not 2?