Closed User4martin closed 1 year ago
The interesting part is that TRegExpr does
[aA](?R)?(?:X|([bcBC]))(?R)?\1
on aABBcAXBc
matches 'aABBcAXBc' (group 1 = c
)
Lets see what happens. All uppercase letters are what is matched in each recursion.
To show this, the same happens
[aA]((?R))?(?:X|([bcBC]))((?R))?\2
on the same text, and the same result.
Only now the 3 groups are
ABB
capture of first recursionThere is only one level of recursion. Inside the recursion the ?
allows that no deeper recursion is done.
c
is captured, and \1
matches a c
.B
is captured, and \1
matches a B
.X
which means that (?:X|([bcBC]))
matches the X
and the capture is not executed. => then \1 should not be able to match anything.But \1 in the 2nd call to the recursion does match the result of the capture in the first call to the recursion.
https://regex101.com/r/rpVYeA/1 => the 2nd recursion does not see the result of the first recursion in this case.
@Alexey-T
I think TRegExpr works correct. \1 do not match capture from outer recursion step.
I think TRegExpr works correct. \1 do not match capture from outer recursion step.
That part is fine with me. (if it turns out, that I need something else, it could be an option), but for now I don't.
The bigger question is, what happens when the recursion is entered a 2nd time (on the same level)?
I think the current behaviour of TRegExpr for this is wrong. => when the first call to the recursion ends, all groups should be restored. When the 2nd call enters, it should start "at empty" again.
del
TRegExprBounds = record
GrpStart: array [0 .. RegexMaxGroups - 1] of PRegExprChar; // pointer to group start in InputString
GrpEnd: array [0 .. RegexMaxGroups - 1] of PRegExprChar; // pointer to group end in InputString
end;
TRegExprBoundsArray = array[0 .. RegexMaxRecursion] of TRegExprBounds;
...
GrpBounds: TRegExprBoundsArray;
so all group-bounds is bound to recursion level. when new level is entered, all groups are cleared, and filled again. when returning to prev level, all old groups are sitting in the array so they are 'restored'. it's ok code?
And index in array is regRecursion
field.
Yes, I have seen the index, and yes: Clear on enter.
That is, regRecursion
will still be increased. Thereby keeping the current level as it is (so it wont be modified within the recursion).
But when the recursion enters
Inc(regRecursion);
FillChar(GrpBounds[regRecursion], SizeOf(GrpBounds[regRecursion]), 0);
And that way the 2nd recursion gets a clean slate.
On the other side we save a bit of work in
procedure TRegExpr.ClearInternalIndexes;
begin
FillChar(GrpBounds[0], SizeOf(GrpBounds[0]), 0);
and ClearMatches
(I have to check why that is duplicated, and remove one).
It no longer needs to clear for recursions.
Btw, the reason for my question were aiming at the above code.
My lexer runs about 7 times faster (in other words in less than 15% of the current time) if I make those changes. Having to clear 3.6KB of memory before each match is extremely time consuming.
With the change the memory will only be cleared, if a recursion is actually needing it. Though if there are several side by side calls to recursion, then more cleaning is done (but that is required to fix the behaviour).
so you are suggesting to smarter clear grpBounds. Faster. I agree, pls do it.
what aboud LOGIC of backrefs with groups? Our code has ok logic?
The only change in logic should be to the example in my first comment
The interesting part is that TRegExpr does aA?(?:X|([bcBC]))(?R)?\1 on aABBcAXBc matches 'aABBcAXBc' (group 1 = c)
IMHO that logic is currently wrong anyway.
In future the 2nd (?R)
in this regex will not see the captures of the first (?R)
.
Both recursions (not nested), run at level regRecursion=1
. And both run "clean" => with an empty groups GrpBounds[1]
Do you mean that changed clearing of grpBounds will lead to fixed logic? If not, what do you suggest to fix the logic?
As for speed.
Do you mean that changed clearing of grpBounds will lead to fixed logic? If not, what do you suggest to fix the logic?
It should. I need to add the tests still.
Please do it and test the logic then
This may not be a bug, but at least a question how it should be expected to work.
https://regex101.com/r/FrdqmJ/1
(?:x|([abc]))(?R)?-\1*
matchesaabxa-a-b-b-a-a
Same match with
(?:x|([abc]))(?R)?-\1
(no*
for the backref)From most inner recursion to outwards
a-a
Capturesa
and\1
matches thea
x___-b
does not capture but\1
matches theb
of the calling (next outer) recursion.b_____-b
Capturesb
and\1
matches theb
However TRegExpr
(?:x|([abc]))(?R)?-\1*
matchesxa-a-
(?:x|([abc]))(?R)?-\1
matchesa-a
changing the text to
aabxa-a--b-a-a
(?:x|([abc]))(?R)?-\1*
matches the full textIn conclusion: With TRegExpr the backreference
\1
- if the current recursion has no capture of its own - does not match the outer callers capture.That may be seen as valid or not. It is at first a decision, and then possibly a documentation issue.
It may also want to be documented, that if the capture for a back reference does either not exist, or has not been triggered yet, then in TRegExpr the back-ref returns false.
This is the same in PCRE. But the ECMA engine matches empty in that case. https://regex101.com/r/7C9IM7/1