PCRE2Project / pcre2

PCRE2 development is now based here.
Other
921 stars 194 forks source link

Some bug #456

Closed oleedd closed 1 month ago

oleedd commented 2 months ago

\w(?R)*\w and \w(?R)?\w for "grtgt" returns "gr" and "tg". They should find "grtg" as with "regex" for Python and JavaScript (with repeating regex many times instead of (?R)).

Additionally: why is step 16 for \w(?R)?+\w on the same string ("grtgt") backtrack of 3 symbols, not 2?

zherczeg commented 2 months ago

pcre2test:

  re> /\w(?R)*\w/
data> grtgt
 0: grtg

Looks correct.

oleedd commented 2 months ago

Hm, I used https://regex101.com/ Got info that it uses 10.42. How is it in 10.42? The steps are from that site as well. Is step 16 really backtrack of 3 symbols (for \w(?R)?+\w)?

zherczeg commented 2 months ago

The latest release is 10.44. Anyway, I don't know how PCRE is used on that site, I suspect they emulating engines with a JavaScript implementation.

oleedd commented 2 months ago

No, it is not possible. It uses a server with PCRE2. Can you please check with 10.42? What is step 16 for \w(?R)?+\w with 10.44? Is this correct? image

zherczeg commented 2 months ago

10.42:

  re> /\w(?R)*\w/
data> grtgt
 0: gr

It looks like there was a bug which was fixed since. You should ask that site to use the latest pcre2.

PhilipHazel commented 2 months ago

Probably this fix in 10.43:

  1. Refactor the handling of whole-pattern recursion (?0) in pcre2_match() so that its end is handled similarly to other recursions. This has altered the behaviour of /|(?0)./endanchored which was previously not right.
oleedd commented 2 months ago

Ok, What about my question about step 16?

PhilipHazel commented 2 months ago

What is step 16? If this is something connected with regex101 then you need to ask its managers.

oleedd commented 2 months ago

PCRE2 should have a debugger or tracing which records all steps. Because it is only available for PCRE on regex101.

ltrzesniewski commented 2 months ago

I wouldn't be surprised if regex101 "simply" used PCRE2_AUTO_CALLOUT

oleedd commented 2 months ago

Probably yes. I am waiting for updating it to check steps and will return to close or ask. Because that backtrack of 3 symbols looks related to that fixed bug.

PhilipHazel commented 2 months ago

Running pcre2test with the -ac option (auto-callout) shows the progress of the match.

oleedd commented 2 months ago

Bad that no binary version of pcre2test to try.

oleedd commented 2 months ago

It was updated. But step 16 is the same and I don't understand it. \w(?R)?+\w for "gtykr". The first \w is matched 5 times. 1 symbol backtrack because no second \w for "r", and "r" becomes second \w for "k". After that no second \w for "y" and 3 symbols backtrack. image Why not 2 symbols backtrack, why 3?

PhilipHazel commented 2 months ago

Your pattern matches two "word" characters, optionally containing, between them, a recursion, that is, two "word" characters, optionally.... in other words, it matches an even number of "word" characters. This is exactly what it does, and it is the same result given by Perl.

PCRE2 version 10.44 2024-06-07 (8-bit) /(\w)((?R)?+)(\w)/ gtykr 0: gtyk 1: g 2: ty 3: k

oleedd commented 2 months ago

I know. But I ask about steps - about the backtrack in step 16.

PhilipHazel commented 2 months ago

I do now know anything about "step 16", which I presume is some artefact of regex 101. But analysing this match

/(\w)((?R)?+)(\w)/ gtykr 0: gtyk 1: g 2: ty 3: k

goes as follows:

LEVEL 0: \w matches "g"; sets $1 to "g"; ?R recurses (remaining subject is "tykr") LEVEL 1: \w matches "t"; ?R recurses ("ykr") LEVEL 2: \w matches "y"; ?R recurses ("kr") LEVEL 3: \w matches "k"; ?R recurses ("r") LEVEL 4: \w matches "r"; ?R recurses ("") LEVEL 5: \w fails to match; backtrack causes recursion to fail LEVEL 4: (?R)? is optional, so zero match is OK; ("") LEVEL 4: the second \w fails to match; whole recursion fails LEVEL 3: the second \w matches "r" so the recursion succeeds ("") LEVEL 2: the second \w fails; recursion fails LEVEL 1: the second \w succeeds, matching "y"; sets $2 to "ty"; recursion succeeds LEVEL 0: the second \w succeeds, matching "k"; sets $3 to "k"; the whole pattern succeeds, matching "gtyk"

At least, that's how I see it. I note that Perl behaves exactly the same way. It is the mixture of succeeding and failing recursions that causes differences in backtracking in the subject.

oleedd commented 2 months ago

LEVEL 2: the second \w fails; recursion fails

That is step 16. Why recursion fails? Why doesn't the second \w match "k"? "kr" may be unmatched.

zherczeg commented 2 months ago

It is obvious why it fails if you check what Philip wrote.

It says: LEVEL 3: the second \w matches "r" so the recursion succeeds ("") -> we are already at the END of the string, since r is the LAST letter in the input, and whatever non-zero width item comes after that, we cannot match. LEVEL 2: the second \w fails; recursion fails -> a \w cannot succeed at the END of the string

oleedd commented 2 months ago

Yes, it cannot succeed at the END of the string. But why no removing "kr" and matching "k" by the second \w after that? That is what I asked.

zherczeg commented 2 months ago

There is no removal operation in regex engines, only backtracking. Backtracking is kind of a status reset, rather than an actual operation. I suspect you have a fundamentally different concept of how engines work that they actually work, so I cannot answer your question.

oleedd commented 2 months ago

I meant bracktracking by removing.

zherczeg commented 2 months ago

I will try to explain one more time what is happening. Lets call the second \w as \w' to make it different from the first \w. They are identical otherwise. First "g" is matched to \w, then we setup a reset point and restart the pattern. Then "t" is matched to \w, setup a reset point and restart the pattern. Then "y" is matched to \w, and from that point we cannot achieve a match whatever we do, since a match requires at least four characters (\w and three \w'-s). Then we backtrack to "t", match the "y" as \w', and "k" as another \w'.

oleedd commented 2 months ago

Yes, there is no forth character. But it can backtrack "kr" and the second \w will match "k". Why doesn't it do this 2 symbol backtrack? There is still no answer to this. My question is and was very concrete (why not 2 but 3). It is only about this one step. We are here:

LEVEL 2: the second \w fails; recursion fails

The second \w fails because no more characters. But after that level 3 ("kr") may be backtracked, and the second \w (level 2) will match "k".

zherczeg commented 2 months ago

This question is answered many times. The engine uses backtracking, not some random movement around the source. The engine NEVER retries anything starting from "k", because there are no alternatives there. The "k" is matched to the first \w, and there is no alternative to a property match. If you would change the first \w to (\w|somethingelse) then the "somethingelse" would be retried.

oleedd commented 2 months ago

It wasn't answered. A real answer should be like this: it does 3 symbols backtrack, not 2 because... No more characters is not a reason! Because "kr" may be backtracked, and the second \w (level 2) will match "k". If without +, it does 2 symbols backtrack there as expected, not 3 (and "k" is matched by the second \w), but I don't see a reason why + influences so. If level 2 was completed (4 symbols), then yes, it couldn't backtrack it because of +, but it wasn't completed. Maybe I don't know something about +. As I know, it prevents backtracking of something that was completely matched (level 2 wasn't completely matched). Not only this?

oleedd commented 2 months ago

If I use an atomic group instead of +, it goes correctly: 2 symbols backtrack ("kr") there and the second \w matches "k". \w(?>(?R))?\w It finds "gt" and "yk". It is expected for me. But with + instead, it backtracks 3 symbols instead of 2 and finds "gtyk". So it is weird for me. I don't understand why. The difference is in that backtrack: 3/2 symbols. It seems that it should work the same.

zherczeg commented 2 months ago

/\w(?R)?+\w/ is the same as /\w(?>(?R)?)\w/. It also behaves differently. There is still no "backtracks 3 symbols", only it reverts to a previous state, and tries a different path.

Btw this pattern is funny:

data> aa
 0: aa
data> aaa
 0: aa
data> aaaa
 0: aa
data> aaaaa
 0: aaaa
data> aaaaaa
 0: aaaaaa
data> aaaaaaa
 0: aaaaaa
data> aaaaaaaa
 0: aa
data> aaaaaaaaa
 0: aaaa
data> aaaaaaaaaa
 0: aaaaaa
data> aaaaaaaaaaa
 0: aaaaaaaa
data> aaaaaaaaaaaa
 0: aaaaaaaaaa
data> aaaaaaaaaaaaa
 0: aaaaaaaaaaaa
data> aaaaaaaaaaaaaa
 0: aaaaaaaaaaaaaa
data> aaaaaaaaaaaaaaa
 0: aaaaaaaaaaaaaa
data> aaaaaaaaaaaaaaaa
 0: aa
data> aaaaaaaaaaaaaaaaa
 0: aaaa
...
oleedd commented 2 months ago

I still don't understand why \w(?R)?+\w reverts to position 2 ("y"), not to position 3 ("k") like \w(?>(?R))?\w does.

PhilipHazel commented 2 months ago

X?+ is the same as (?>X?) which is NOT the same as (?>X)? because in that case the ? is not part of the atomic group. I am sorry we don't seem to be able to explain to you how this all works. X?+ can match X one or zero times. As it is greedy, the engine tries once. If that matches, but something later does not, there cannot be a backtrack to try zero times because of the + but without the + it could backtrack and try zero times (that is, skip the X altogether). When there is a backtrack, everything goes back to what it was at the point the backtrack was remembered. When level 3 has matched "kr" and level 2 fails, it goes back to level 1, which has the position at "ykr" because all it has matched is "t".

oleedd commented 1 month ago

I did some experiments and understood why. I thought that with +, it can't be backtracked but can be fully unmatched if a previous group (level 2 in our case) can't be matched (previous items have more high priority). But it looks like it can be unmatched only with something that includes it. It is not obvious because it could work differently.