andgineer / TRegExpr

Regular expressions (regex), pascal.
https://regex.sorokin.engineer/en/latest/
MIT License
174 stars 63 forks source link

Range check error in OP_LOOP, OP_LOOPNG: #307

Closed User4martin closed 1 year ago

User4martin commented 1 year ago

regex A(?>(?:b|c(?:d|e)*?)*?_)a text Acece_x_Acece_a_

produce a range check error LoopStackIdx = 0 for TRegExprLoopStack = array [1 .. LoopStackMax] of integer;

revision 39aaae8d37d3399a04f84c14cd8b6c0d38abfdf6

Alexey-T commented 1 year ago

w/out thinking I changed lower index of this type to 0

type
  TRegExprLoopStack = array [0 .. LoopStackMax] of integer;     

now demo test_dlg handles it...

User4martin commented 1 year ago

Looking at

    LoopStackIdx: integer; // 0 - out of all loops

I don't think that is correct.

It gets increased in

      OP_LOOPENTRY:
        begin // ###0.925
          no := LoopStackIdx;
          Inc(LoopStackIdx);

OP_LOOP, OP_LOOPNG: increases and decreases it.

But that is as far as I looked.

User4martin commented 1 year ago

But just to prove the point: I changed it to start at 0, and then run the below.

RegEx A(?>(?:b|c(?:d|e(?:f|g)*?)*?)*?_)a Text: Acece_x_Acecegg_a_

The index goes to -1

User4martin commented 1 year ago

Interesting is, that it does not crash, if the text has a single "g", only with the 2nd "g".

Maybe that holds some clue. But currently got other stuff to do.

Alexey-T commented 1 year ago

Simple fix: after Dec(LoopStackIdx); add in 2 places

            if LoopStackIdx < 1 then
                LoopStackIdx := 1; 
User4martin commented 1 year ago

Then it does not match https://regex101.com/r/7FAwus/1

A(?>(?:b|c(?:d|e(?:f|g){0,2}?){1,1}?){2,2}?_)a Acece_x_Acecegg_a_

Not sure if that is the same bug, but I thing so because A(?>(?:b|c(?:d|e(?:f|g){0,2}?){0,1}?){2,2}?_)a does match.

I only changed {1,1} to {0,1}

Despite the sequence starting with e exists exactly 2 times.

But it was logical that it would happen, if the index gets decreased wrongly, then it does not only go out of range, at some time it also points to the wrong entry.


Problem is, it is easy to make it fail (based on the observations I have). But it is far from easy to track it. ("not easy" => "time consuming")

Alexey-T commented 1 year ago

maybe: you added additional Exit's for the OP_STAR, but did not add them for OP_LOOPNG - ie for {0,1}? .

no - if I change 'atomic' grp to 'normal', result don't change.

User4martin commented 1 year ago

I might have an idea what the issue is. Need to do some more checking though...

User4martin commented 1 year ago

So this were actually two issues.

1)

LoopStackIdx is incremented once in OP_LOOPENTRY When iterating a loop, each OP_LOOP makes a recursive call. Each recursive call decrements the LoopStackIdx. That is far more often than it got incremented. To correct for that, nested calls are wrapped. They store the LoopStackIdx and restore it (undoing the nested decrement, allowing themself to decrement again).

Some calls were not guarded. Fixed that.

A note on the side.


2)

When a loop tests the if at its current iteration, the pattern after the loop will be a match (and the loop therefore would have found a success), then the call to test "the pattern after" will run through the OP_LOOP of outer loops.

This is actually correct. It does actually test if the current loop, at the current iteration is fine for any iteration of the outer (that is: "not just the outers current").

But, for that to work, the LoopStackIdx must be adjusted to point to the outer loops data.

User4martin commented 1 year ago

Ok, this is way more complex than it initially looked like.... But solvable.

I explain one of the issues now, because I have tracked it now, just because I then don't have to do later.... (to explain the upcoming changeset).

A simplified example:

RegEx 'Aa(x|(x|y){3,4}?){1,3}?B' Text 'AayyyyyyyyyyB' // 10 y

 1: BRANCH (131)
 6: EXACTLY (15) Aa
15: LOOPENTRY (110)  OUTER LOOP
20:  OPEN[1] (25)
25:   BRANCH (38)
30:   EXACTLY (105) x
38:   BRANCH (105)
43:   EXACTLY (51) y
51:   LOOPENTRY (92)  INNER LOOP
56:    OPEN[2] (61)
61:     BRANCH (74)
66:     EXACTLY (87) x
74:     BRANCH (87)
79:     EXACTLY (87) y
87:    CLOSE[2] (92)
92:   LOOPNG (105)  -> (56) {3,4}  INNER LOOP
105:  CLOSE[1] (110)
110: LOOPNG (123)  -> (20) {1,3}  OUTER LOOP
123: EXACTLY (131) B
131: END (0)

In the below each line is a nested call of the previous. (no backtracing in this simplified example). So this is all going deeper and deeper on the stack (because it may need the option to backtrace if the text was different)

Each time on instruction 92, when the inner loop needs to check, if the remaining pattern (including outstanding loops of the outer loop) could match, the inner loop calls MatchPrim(105) => which goes to the outer loop on 110.

There is only one instance of the outer loop, it always needs LoopStackIdx = 1. There are 3 instances of inner loops, with different LoopStackIdx. So the outer LoopStackIdx can not be calculated from that.

Trying to achieve this with a "LoopStackIdx" may be possible, but will be hard...

I have a different idea...


Also the "not that complex" pattern 'Aa(x|y(x|y(x|y){3,4}?){3,4}?){3,4}?' needs more than the LoopStackMax=10 Depending on what is matched, there can be 4 instances of the middle loop, which can each have 4 of the inner => that would be 17. Add a copy of the pattern at the end, and it will be 34.


I have a good idea how to solve that...

Alexey-T commented 1 year ago

@User4martin Let's close?

User4martin commented 1 year ago

Wait for the PR?

But if you prefer, happy if you close it before