andgineer / TRegExpr

Regular expressions (regex), pascal.
https://regex.sorokin.engineer/en/latest/
MIT License
174 stars 63 forks source link

Implement \G #297 #303

Closed User4martin closed 1 year ago

User4martin commented 1 year ago

See #297

This is based on #300 as some changes affect nearby lines and lead to conflicts if individually applied.

Once #300 is merged, only one commit should remain to be merged.

Alexey-T commented 1 year ago

if #303 has #300 as its part, let's close #300 ?

Alexey-T commented 1 year ago

fInputContinue: PRegExprChar; // pointer to first char of input string - misleading comment.

Alexey-T commented 1 year ago
  1. Please fill the history.txt, and adjust the version to 1.160 (in the unit + history.txt)
  2. Pls test how the demo "test_dlg" shows opcode with \G
Alexey-T commented 1 year ago

Thanks!

  Ptr := fInputStart + AOffset - 1;
  fInputContinue := Ptr;

if I got it, you set fInputContinue even on failing execs, but we must set it only on successfull exec.

User4martin commented 1 year ago

Please check the history, if entries are ok.

Sorry about the comment, copy and paste is still a thing ;)

test_dlg now shows "Not fixed op: CONTINUE_POS"

Alexey-T commented 1 year ago

Pls also comment 2 existing UniqueString calls.

User4martin commented 1 year ago

if https://github.com/andgineer/TRegExpr/pull/303 has https://github.com/andgineer/TRegExpr/pull/300 as its part, let's close https://github.com/andgineer/TRegExpr/pull/300 ?

If its fine to merge both in one, yes. (I.e. if you have no rule to merge those features separately)

Alexey-T commented 1 year ago

Its fine to merge both in one. @andgineer is busy man - he won't be happy to dig all this

Alexey-T commented 1 year ago

Also: it's good to add \G to official docs, also in this repo. pls add here: https://regex.sorokin.engineer/en/latest/regular_expressions.html#boundaries

User4martin commented 1 year ago

Thanks!

  Ptr := fInputStart + AOffset - 1;
  fInputContinue := Ptr;

if I got it, you set fInputContinue even on failing execs, but we must set it only on successfull exec.

This is set before the regex is run.

It is either the value given by the user re.Exec(MyOffset)

Or it is given by ExecNext, which already called ExecPrim with the offset from the last match (I have not changed that / the offset was stored on GrpBounds[0].GrpEnd[0] and will be stored into fInputContinue before we start the next match)

User4martin commented 1 year ago

Pls also comment 2 existing UniqueString calls.

Done in ad665552d93c00d4bdc8de0ba3843bc47760564d

Alexey-T commented 1 year ago

OP_CONTINUE_POS = Succ(OP_SUBCALL_LAST); won't work for undefined UnicodeRE because

  OP_SUBCALL_LAST =
    {$IFDEF UnicodeRE}
    TReOp(Ord(OP_SUBCALL) + RegexMaxGroups - 1);
    {$ELSE}
    High(REChar); // must fit to 0..255 range
    {$ENDIF}

so, insert new opcode in the middle, after the OP_RECUR. value 49 is good. seems OP_OPEN (=50) is not needing the change.

User4martin commented 1 year ago

Also: it's good to add \G to official docs, also in this repo. pls add here: https://regex.sorokin.engineer/en/latest/regular_expressions.html#boundaries

Done / not familiar with the markup used. Please check.

Also the section is called "Line Boundaries" and starts with". any char" ? (anyway that is an issue of its own, if considered an issue )

Alexey-T commented 1 year ago

oh, value 26 is good too!

 OP_EOL2 = TReOp(25); // like OP_EOL but also matches before final line-break
  OP_BSUBEXP = TREOp(28);
Alexey-T commented 1 year ago
`\G``        zero-length match at the end pos of the previous match. For ExecNext and Exec(Offset)

here is docs for usual users of programs, so the ending . For ExecNext and Exec(Offset) is not OK, remove it.

Alexey-T commented 1 year ago

Also the section is called "Line Boundaries" and starts with". any char" ? (anyway that is an issue of its own, if considered an issue )

right. please remove that line from the topic.

User4martin commented 1 year ago
`\G``        zero-length match at the end pos of the previous match. For ExecNext and Exec(Offset)

here is docs for usual users of programs, so the ending . For ExecNext and Exec(Offset) is not OK, remove it.

Ah, ok. I this case would this be good?

`\G`` zero-length match at the end pos of the previous match when using the g (global) modifier.

Alexey-T commented 1 year ago

when using the g (global) modifier. is not OK ending- \G works always, we don't check 'g' modifier.

Alexey-T commented 1 year ago

when using the g (global) modifier. is not OK ending- \G works always, we don't check 'g' modifier.

Right, but when using the feature without writing the pascal code

does code check existing of 'g' modifier? no. so why to write this ending?

User4martin commented 1 year ago

when using the g (global) modifier. is not OK ending- \G works always, we don't check 'g' modifier.

Right, but when using the feature without writing the pascal code

does code check existing of 'g' modifier? no. so why to write this ending?

I hit the send button by accident.

Alexey-T commented 1 year ago

parasitic dot is not needed

re

User4martin commented 1 year ago

Also the section is called "Line Boundaries" and starts with". any char" ? (anyway that is an issue of its own, if considered an issue )

right. please remove that line from the topic.

It then needs to go elsewhere.

For now I created a new issue, so it can be addressed on its own.

Alexey-T commented 1 year ago

you did good work! I don't have powers today to make all these testings in the test-project.

User4martin commented 1 year ago

Btw, I did test the new StrLPos, but since it is hardly going to change, I did not write a test-case for it.

But for peace of mind, this is what I tested

a := 'abcdefaxdef';
writeln(StrLPos( PRegExprChar(a), PRegExprChar('a'), 0,1  ) - PRegExprChar(a) ); // to short
writeln(StrLPos( PRegExprChar(a), PRegExprChar('a'), 1,1  ) - PRegExprChar(a) ); //ok 
writeln(StrLPos( PRegExprChar(a), PRegExprChar('a'), 2,1  ) - PRegExprChar(a) ); // ok
writeln;
writeln(StrLPos( PRegExprChar(a), PRegExprChar('ab'), 0 ,2 ) - PRegExprChar(a) ); // to short
writeln(StrLPos( PRegExprChar(a), PRegExprChar('ab'), 1 ,2 ) - PRegExprChar(a) ); // to short
writeln(StrLPos( PRegExprChar(a), PRegExprChar('ab'), 2 ,2 ) - PRegExprChar(a) ); // ok
writeln(StrLPos( PRegExprChar(a), PRegExprChar('ab'), 7 ,2 ) - PRegExprChar(a) ); // ok
writeln;
writeln(StrLPos( PRegExprChar(a), PRegExprChar('ax1'), 0 ,2 ) - PRegExprChar(a) ); // to short
writeln(StrLPos( PRegExprChar(a), PRegExprChar('ax1'), 1 ,2 ) - PRegExprChar(a) ); // to short
writeln(StrLPos( PRegExprChar(a), PRegExprChar('ax1'), 2 ,2 ) - PRegExprChar(a) ); // to short
writeln(StrLPos( PRegExprChar(a), PRegExprChar('ax1'), 7 ,2 ) - PRegExprChar(a) ); // to short
writeln(StrLPos( PRegExprChar(a), PRegExprChar('ax1'), 8 ,2 ) - PRegExprChar(a) ); // ok
writeln(StrLPos( PRegExprChar(a), PRegExprChar('ax1'), 9 ,2 ) - PRegExprChar(a) ); //ok
writeln;
// all below do not find anything
writeln(StrLPos( PRegExprChar(a), PRegExprChar('fd'), 0 ,2 ) - PRegExprChar(a) );
writeln(StrLPos( PRegExprChar(a), PRegExprChar('fd'), 1 ,2 ) - PRegExprChar(a) );
writeln(StrLPos( PRegExprChar(a), PRegExprChar('fd'), 6 ,2 ) - PRegExprChar(a) );
writeln(StrLPos( PRegExprChar(a), PRegExprChar('fd'), 7 ,2 ) - PRegExprChar(a) );
writeln;
Alexey-T commented 1 year ago

Can we have 'strLpos' name in good casing? and formatting of StrLScan with indent=2spaces?

Alexey-T commented 1 year ago

Thanks

while (count < len) and (P[count] <> #0) do

StrLScan fails on chr0. But regex engine handles chr0 OK, so if you can improve it..

Alexey-T commented 1 year ago
Alexey-T commented 1 year ago

great!

Alexey-T commented 1 year ago

lines are not needed in StrLPos

    if p = nil then
       exit;
Alexey-T commented 1 year ago

you replaced

    if Pos(regMustString, fInputString) = 0 then Exit;

(check for NOT FOUND) to

if StrLPos(fInputStart, PRegExprChar(regMustString), fInputEnd - fInputStart, length(regMustString)) <> nil then
        exit;

(check for FOUND?)

it's OK? I mean <> nil

User4martin commented 1 year ago

(check for FOUND?) it's OK? I mean <> nil

I tried to make a test case for it, but I could not make a pattern that would trigger FLAG_SPECSTART.

Alexey-T commented 1 year ago
Alexey-T commented 1 year ago

ExecPrim's SlowChecks is False in some wrappers

function TRegExpr.Exec(const AInputString: RegExprString): boolean;
begin
  InputString := AInputString;
  Result := ExecPrim(1, False, False, False);
end; { of function TRegExpr.Exec
  -------------------------------------------------------------- }

{$IFDEF OverMeth}
function TRegExpr.Exec: boolean;
var
  SlowChecks: boolean;
begin
  SlowChecks := fInputEnd - fInputStart < fSlowChecksSizeMax;
  Result := ExecPrim(1, False, SlowChecks, False);
end; { of function TRegExpr.Exec
  -------------------------------------------------------------- }

function TRegExpr.Exec(AOffset: integer): boolean;
begin
  Result := ExecPrim(AOffset, False, False, False);
end; { of function TRegExpr.Exec
  -------------------------------------------------------------- }
{$ENDIF}   
Alexey-T commented 1 year ago

Based on this info, can you please add testcase with regex \w*abcd and SlowChecks=True , testcase to run with regMustString.

Alexey-T commented 1 year ago

Thank you

Alexey-T commented 1 year ago

fInputContinue. when it is custom (not automatic)? ExecPrim has

  Ptr := fInputStart + AOffset - 1;
  fInputContinue := Ptr;  

and all Exec* funcs call ExecPrim. so, in all cases - fInputContinue is set to automatic value? so, even if

procedure TRegExpr.SetInputRange(AStart, AEnd, AContinueAnchor: PRegExprChar);
begin
  fInputString := '';
  fInputStart := AStart;
  fInputEnd := AEnd;
  fInputContinue := AContinueAnchor;
end;

is called, ExecPrim overrides custom fInputContinue value?

User4martin commented 1 year ago

fInputContinue. when it is custom (not automatic)? ExecPrim has

  Ptr := fInputStart + AOffset - 1;
  fInputContinue := Ptr;  

and all Exec* funcs call ExecPrim. so, in all cases - fInputContinue is set to automatic value? so, even if

"Custom" means AOffset. => AOffset can be given by the user (the person writing the Pascal code controlling the regex).

The user can give it via:

    function Exec(AOffset: integer): boolean; overload;
    function ExecPos(AOffset: integer {$IFDEF DefParam} = 1{$ENDIF}): boolean;

And for modifier "g", if you call ExecNext then GrpBounds[0].GrpEnd[0] is passed as AOffset.

\G is the position from which matching was continued (or started).

procedure TRegExpr.SetInputRange(AStart, AEnd, AContinueAnchor: PRegExprChar);
begin
  fInputString := '';
  fInputStart := AStart;
  fInputEnd := AEnd;
  fInputContinue := AContinueAnchor;
end;

is called, ExecPrim overrides custom fInputContinue value?

SetInputRange is private, and is only called on fHelper used for look-behind. It follows the settings on the main instance.

Alexey-T commented 1 year ago
procedure TRegExpr.SetInputRange(AStart, AEnd, AContinueAnchor: PRegExprChar);
begin
  fInputString := '';
  fInputStart := AStart;
  fInputEnd := AEnd;
  fInputContinue := AContinueAnchor;
end;

if we avoid setting fInputContinue here ^^^, what do we miss?

User4martin commented 1 year ago

Normally if I continue to match (ExecNext / Exec(AOffset) the pattern can match at AOffset or anywhere after AOffset.

If you have a pattern, that you use to parse some text \s*(var|const|type then this can skip none matching content. If you don't wont it to skip anything, you do \G\s*(var|const|type ( ^ wont do, as it is not the start of line, but the start of the continuation match)

User4martin commented 1 year ago

if we avoid setting fInputContinue here ^^^, what do we miss?

(?<!\G(?:var|const|type).*)\s+(.*) match only if the continuation start pos did not begin with var/const or type.

Or

(?<!(?:var|const|type).*\Gc*)\s+(.*) match only if not BEFORE the continuation start pos there had been var/const or type.

Alexey-T commented 1 year ago

yes, I got idea of these 2 regexes. but again, you didnot answer:

this line fInputContinue := AContinueAnchor; is ignored, because the ExecPrim (called in Exec*) will overwrite the fInputContinue, yes?

User4martin commented 1 year ago

yes, I got idea of these 2 regexes. but again, you didnot answer:

this line fInputContinue := AContinueAnchor; is ignored, because the ExecPrim (called in Exec*) will overwrite the fInputContinue, yes?

Ah, I see. I was wondering if I was missing something. The answer I gave was way to obvious.

So here is the answer.

No it does not overwrite it.

The only code calling it is

function TRegExpr.MatchAtOnePos(APos: PRegExprChar): boolean;
begin
...
      fHelper.SetInputRange(APos - fHelperLen, APos, fInputContinue);
      if fHelper.MatchAtOnePos(APos - fHelperLen) then

fHelper does not have a further helper, and goes to

function TRegExpr.MatchAtOnePos(APos: PRegExprChar): boolean;
begin
...
  Result := MatchPrim(regCodeWork);

MatchPrim skips ExecPrim

Alexey-T commented 1 year ago

now I see; thanks. I forgot about that fHelper code.

Alexey-T commented 1 year ago

@User4martin You are talent! thanks!

https://learn.microsoft.com/en-us/dotnet/standard/base-types/backtracking-in-regular-expressions Let's rename IDs:

FIsBactracingAtomic -> GrpBacktrackingIsAtomic savedBackracingAtomic -> savedBacktrackingIsAtomic

Alexey-T commented 1 year ago

Increase the version to 1.161, and add History.txt info about this fix, please

User4martin commented 1 year ago

Let's rename IDs: FIsBactracingAtomic -> GrpBacktrackingIsAtomic savedBackracingAtomic -> savedBacktrackingIsAtomic

Before I do, I think the new names are misleading at least... But mine aren't better.

This is not about the group being atomic (well by implication it is). But in an atomic group regular backtracking can happen before a first match is found.

What it should say is "The entire group is (to be) backtracked as one entity" => "The group is backtracked as a single atom"

So then it would be IsBacktrackingGroupAsAtom (with/without F not sure when that should be prefixed). (and savedIsBacktrackingGroupAsAtom)

Alexey-T commented 1 year ago

So then it would be IsBacktrackingGroupAsAtom (with/without F not sure when that should be prefixed). (and savedIsBacktrackingGroupAsAtom)

Okay.

Alexey-T commented 1 year ago

Comment this var savedIsBacktrackingGroupAsAtom: boolean;

Alexey-T commented 1 year ago

Pls change in test_dlg/unit1.pas this

    if reg.IsFixedLength(op, i) then
      ListDump.Items.Add('Fixed length: '+IntToStr(i))
    else
      ListDump.Items.Add('Not fixed op: '+reg.DumpOp(op));

to this

    if reg.IsFixedLength(op, i) then
      ListDump.Items.Add('Match has fixed len: '+IntToStr(i))
    else
      ListDump.Items.Add('Match has not fixed len because of opcode '+reg.DumpOp(op));
Alexey-T commented 1 year ago

New test for 'bads' now produces exception, it is OK, but better comment this test to avoid exceptions in test-app.