andgineer / TRegExpr

Regular expressions (regex), pascal.
https://regex.sorokin.engineer/en/latest/
MIT License
174 stars 63 forks source link

Recursion and backreferences #324

Closed User4martin closed 1 year ago

User4martin commented 1 year ago

This may not be a bug, but at least a question how it should be expected to work.

https://regex101.com/r/FrdqmJ/1

(?:x|([abc]))(?R)?-\1* matches aabxa-a-b-b-a-a

Same match with (?:x|([abc]))(?R)?-\1 (no * for the backref)

From most inner recursion to outwards

However TRegExpr

(?:x|([abc]))(?R)?-\1* matches xa-a- (?:x|([abc]))(?R)?-\1 matches a-a

changing the text to aabxa-a--b-a-a (?:x|([abc]))(?R)?-\1* matches the full text

In conclusion: With TRegExpr the backreference \1 - if the current recursion has no capture of its own - does not match the outer callers capture.

That may be seen as valid or not. It is at first a decision, and then possibly a documentation issue.


It may also want to be documented, that if the capture for a back reference does either not exist, or has not been triggered yet, then in TRegExpr the back-ref returns false.

This is the same in PCRE. But the ECMA engine matches empty in that case. https://regex101.com/r/7C9IM7/1

User4martin commented 1 year ago

The interesting part is that TRegExpr does [aA](?R)?(?:X|([bcBC]))(?R)?\1 on aABBcAXBc matches 'aABBcAXBc' (group 1 = c)

Lets see what happens. All uppercase letters are what is matched in each recursion.

To show this, the same happens [aA]((?R))?(?:X|([bcBC]))((?R))?\2 on the same text, and the same result. Only now the 3 groups are

There is only one level of recursion. Inside the recursion the ? allows that no deeper recursion is done.

But \1 in the 2nd call to the recursion does match the result of the capture in the first call to the recursion.

https://regex101.com/r/rpVYeA/1 => the 2nd recursion does not see the result of the first recursion in this case.

User4martin commented 1 year ago

@Alexey-T

Alexey-T commented 1 year ago

I think TRegExpr works correct. \1 do not match capture from outer recursion step.

User4martin commented 1 year ago

I think TRegExpr works correct. \1 do not match capture from outer recursion step.

That part is fine with me. (if it turns out, that I need something else, it could be an option), but for now I don't.


The bigger question is, what happens when the recursion is entered a 2nd time (on the same level)?

I think the current behaviour of TRegExpr for this is wrong. => when the first call to the recursion ends, all groups should be restored. When the 2nd call enters, it should start "at empty" again.

Alexey-T commented 1 year ago

del

Alexey-T commented 1 year ago
  TRegExprBounds = record
    GrpStart: array [0 .. RegexMaxGroups - 1] of PRegExprChar; // pointer to group start in InputString
    GrpEnd: array [0 .. RegexMaxGroups - 1] of PRegExprChar; // pointer to group end in InputString
  end;
  TRegExprBoundsArray = array[0 .. RegexMaxRecursion] of TRegExprBounds;

...
    GrpBounds: TRegExprBoundsArray;

so all group-bounds is bound to recursion level. when new level is entered, all groups are cleared, and filled again. when returning to prev level, all old groups are sitting in the array so they are 'restored'. it's ok code?

Alexey-T commented 1 year ago

And index in array is regRecursion field.

User4martin commented 1 year ago

Yes, I have seen the index, and yes: Clear on enter.

That is, regRecursion will still be increased. Thereby keeping the current level as it is (so it wont be modified within the recursion). But when the recursion enters

            Inc(regRecursion);
            FillChar(GrpBounds[regRecursion], SizeOf(GrpBounds[regRecursion]), 0);

And that way the 2nd recursion gets a clean slate.

On the other side we save a bit of work in

procedure TRegExpr.ClearInternalIndexes;
begin
  FillChar(GrpBounds[0], SizeOf(GrpBounds[0]), 0); 

and ClearMatches (I have to check why that is duplicated, and remove one).

It no longer needs to clear for recursions.


Btw, the reason for my question were aiming at the above code.

My lexer runs about 7 times faster (in other words in less than 15% of the current time) if I make those changes. Having to clear 3.6KB of memory before each match is extremely time consuming.

With the change the memory will only be cleared, if a recursion is actually needing it. Though if there are several side by side calls to recursion, then more cleaning is done (but that is required to fix the behaviour).

Alexey-T commented 1 year ago

so you are suggesting to smarter clear grpBounds. Faster. I agree, pls do it.

what aboud LOGIC of backrefs with groups? Our code has ok logic?

User4martin commented 1 year ago

The only change in logic should be to the example in my first comment

The interesting part is that TRegExpr does aA?(?:X|([bcBC]))(?R)?\1 on aABBcAXBc matches 'aABBcAXBc' (group 1 = c)

IMHO that logic is currently wrong anyway.

In future the 2nd (?R) in this regex will not see the captures of the first (?R).

Both recursions (not nested), run at level regRecursion=1 . And both run "clean" => with an empty groups GrpBounds[1]

Alexey-T commented 1 year ago

Do you mean that changed clearing of grpBounds will lead to fixed logic? If not, what do you suggest to fix the logic?

User4martin commented 1 year ago

As for speed.

User4martin commented 1 year ago

Do you mean that changed clearing of grpBounds will lead to fixed logic? If not, what do you suggest to fix the logic?

It should. I need to add the tests still.

Alexey-T commented 1 year ago

Please do it and test the logic then