Closed User4martin closed 1 year ago
w/out thinking I changed lower index of this type to 0
type
TRegExprLoopStack = array [0 .. LoopStackMax] of integer;
now demo test_dlg handles it...
Looking at
LoopStackIdx: integer; // 0 - out of all loops
I don't think that is correct.
It gets increased in
OP_LOOPENTRY:
begin // ###0.925
no := LoopStackIdx;
Inc(LoopStackIdx);
OP_LOOP, OP_LOOPNG: increases and decreases it.
But that is as far as I looked.
But just to prove the point: I changed it to start at 0, and then run the below.
RegEx A(?>(?:b|c(?:d|e(?:f|g)*?)*?)*?_)a
Text: Acece_x_Acecegg_a_
The index goes to -1
Interesting is, that it does not crash, if the text has a single "g", only with the 2nd "g".
Maybe that holds some clue. But currently got other stuff to do.
Simple fix: after
Dec(LoopStackIdx);
add in 2 places
if LoopStackIdx < 1 then
LoopStackIdx := 1;
Then it does not match https://regex101.com/r/7FAwus/1
A(?>(?:b|c(?:d|e(?:f|g){0,2}?){1,1}?){2,2}?_)a
Acece_x_Acecegg_a_
Not sure if that is the same bug, but I thing so because
A(?>(?:b|c(?:d|e(?:f|g){0,2}?){0,1}?){2,2}?_)a
does match.
I only changed {1,1}
to {0,1}
Despite the sequence starting with e
exists exactly 2 times.
But it was logical that it would happen, if the index gets decreased wrongly, then it does not only go out of range, at some time it also points to the wrong entry.
Problem is, it is easy to make it fail (based on the observations I have). But it is far from easy to track it. ("not easy" => "time consuming")
maybe: you added additional Exit's for the OP_STAR, but did not add them for OP_LOOPNG - ie for {0,1}?
.
no - if I change 'atomic' grp to 'normal', result don't change.
I might have an idea what the issue is. Need to do some more checking though...
So this were actually two issues.
1)
LoopStackIdx
is incremented once in OP_LOOPENTRY
When iterating a loop, each OP_LOOP makes a recursive call. Each recursive call decrements the LoopStackIdx
. That is far more often than it got incremented.
To correct for that, nested calls are wrapped. They store the LoopStackIdx
and restore it (undoing the nested decrement, allowing themself to decrement again).
Some calls were not guarded. Fixed that.
A note on the side.
dec(LoopStackIdx
removed. It returns a wrong value anyway, and the caller does fix it anyway.2)
When a loop tests the if at its current iteration, the pattern after the loop will be a match (and the loop therefore would have found a success), then the call to test "the pattern after" will run through the OP_LOOP of outer loops.
This is actually correct. It does actually test if the current loop, at the current iteration is fine for any iteration of the outer (that is: "not just the outers current").
But, for that to work, the LoopStackIdx
must be adjusted to point to the outer loops data.
Ok, this is way more complex than it initially looked like.... But solvable.
I explain one of the issues now, because I have tracked it now, just because I then don't have to do later.... (to explain the upcoming changeset).
A simplified example:
RegEx 'Aa(x|(x|y){3,4}?){1,3}?B' Text 'AayyyyyyyyyyB' // 10 y
1: BRANCH (131)
6: EXACTLY (15) Aa
15: LOOPENTRY (110) OUTER LOOP
20: OPEN[1] (25)
25: BRANCH (38)
30: EXACTLY (105) x
38: BRANCH (105)
43: EXACTLY (51) y
51: LOOPENTRY (92) INNER LOOP
56: OPEN[2] (61)
61: BRANCH (74)
66: EXACTLY (87) x
74: BRANCH (87)
79: EXACTLY (87) y
87: CLOSE[2] (92)
92: LOOPNG (105) -> (56) {3,4} INNER LOOP
105: CLOSE[1] (110)
110: LOOPNG (123) -> (20) {1,3} OUTER LOOP
123: EXACTLY (131) B
131: END (0)
In the below each line is a nested call of the previous. (no backtracing in this simplified example). So this is all going deeper and deeper on the stack (because it may need the option to backtrace if the text was different)
Each time on instruction 92, when the inner loop needs to check, if the remaining pattern (including outstanding loops of the outer loop) could match, the inner loop calls MatchPrim(105) => which goes to the outer loop on 110.
There is only one instance of the outer loop, it always needs LoopStackIdx = 1. There are 3 instances of inner loops, with different LoopStackIdx. So the outer LoopStackIdx can not be calculated from that.
Trying to achieve this with a "LoopStackIdx" may be possible, but will be hard...
I have a different idea...
Also the "not that complex" pattern
'Aa(x|y(x|y(x|y){3,4}?){3,4}?){3,4}?'
needs more than the LoopStackMax=10
Depending on what is matched, there can be 4 instances of the middle loop, which can each have 4 of the inner => that would be 17.
Add a copy of the pattern at the end, and it will be 34.
I have a good idea how to solve that...
@User4martin Let's close?
Wait for the PR?
But if you prefer, happy if you close it before
regex
A(?>(?:b|c(?:d|e)*?)*?_)a
textAcece_x_Acece_a_
produce a range check error
LoopStackIdx
= 0 forTRegExprLoopStack = array [1 .. LoopStackMax] of integer;
revision 39aaae8d37d3399a04f84c14cd8b6c0d38abfdf6