andgineer / TRegExpr

Regular expressions (regex), pascal.
https://regex.sorokin.engineer/en/latest/
MIT License
174 stars 63 forks source link

Review of `regMustString` // follow up #299 #301

Closed User4martin closed 1 year ago

User4martin commented 1 year ago

In #299 the following observations was made by me

the Pos(regMustString, fInputString) is wrong (inefficent)? Unless regLookbehind = true (and even then I don't know if 'regMustString' can be in a look-behind), then the regMustString must appear after AOffset?

Reply by @Alexey-T

regMustString must be found in char (ie TRegExprChar) buffer between fInputStart and fInputEnd. PosEx() is good, it can use StartPos param, but we also need EndPos param for our search.

With strLPos from the pull request #300, this could be easily done, once the PR is applied.

I did test (?<=abcd)[0-9]*foo.*[0-9] will use foo as regMustString. It will not consider abcd (even though that also must be present, is longer, and needs to be found at an earlier location).

If it is ever changed to consider abcd then the AOffset can unly be added to the start-pos of the search if not regLookbehind.

Alexey-T commented 1 year ago

(?<=abcd)[0-9]*foo.*[0-9]

regMustString is 'foo' and it is OK. 'abcd' must occur but it occurs with some gap from 'foo'. yes we need presense of 'abcd' and 'foo' but do you want to add regMustString2 ? without 2nd variable we cannot require 'abcd' presense

User4martin commented 1 year ago

without 2nd variable we cannot require 'abcd' presense

Actually, regMustString is described as find the longest literal string that must appear and make it the regMust

So, there is no need for a 2nd variable. abcd is the longest (in the example it is longer than foo.

The issue is that the code is currently not searching smart enough, and does not find it. => If ever it would find it, it would simply follow the rule of the longest.

In that case, it would be unknown, if the regMustString is from a look-behind or not (that could be stored of course)

If it is unknown, it still works: if not regLookbehind would assume worst case and assume it is (could be) from the look behind.

Alexey-T commented 1 year ago

yes, right. it's interesting why code doesn't find regMust in the lookbehind. lookbehind part is just a 'group' for the parser; when all is found - parser adjusts the result

    // with lookbehind, increase found position by the len of group=1
    if regLookbehind then
      Inc(GrpBounds[0].GrpStart[0], GrpBounds[0].GrpEnd[1] - GrpBounds[0].GrpStart[1]);
Alexey-T commented 1 year ago

@User4martin Let's close?