andgineer / TRegExpr

Regular expressions (regex), pascal.
https://regex.sorokin.engineer/en/latest/
MIT License
174 stars 63 forks source link

Allow to set fInputStart and fInputEnd #299

Closed User4martin closed 1 year ago

User4martin commented 1 year ago

An application may have a long string, and needs to apply a pattern to a substring, such that ^ and $ will match begin/end of the substring.

Currently the application must make a copy of the substring RegEx.InputString := copy(text, a, b); It would be nice to save the time of copying strings around.

Alternatively the TRegEx could allow to set InputString from a PChar /PWideChar. And specify the length (instead of looking for a terminating #0)

Alexey-T commented 1 year ago

we have private methods

procedure TRegExpr.SetInputString(const AInputString: RegExprString);
begin
  ClearMatches;

  fInputString := AInputString;
  UniqueString(fInputString);

  fInputStart := PRegExprChar(fInputString);
  fInputEnd := fInputStart + Length(fInputString);
end;

procedure TRegExpr.SetInputRange(AStart, AEnd: PRegExprChar);
begin
  fInputString := '';
  fInputStart := AStart;
  fInputEnd := AEnd;
end;          

so i suggest to add SetInput public method: SetInput(const AInputString: RegExprString; AStartPos, ALen: integer); can you make the patch and test it?

pls use our current coding style.

User4martin commented 1 year ago

I have a look...

Just started to see, how much FInputString is used...

Would I be right to say that

function TRegExpr.ExecPrim(AOffset: integer;
  ATryOnce, ASlowChecks, ABackward: boolean): boolean;
begin
...
  Ptr := fInputStart + AOffset - 1;

  // If there is a "must appear" string, look for it.
  if ASlowChecks then
    if regMustString <> '' then
      if Pos(regMustString, fInputString) = 0 then Exit;

the Pos(regMustString, fInputString) is wrong (inefficent)?

Unless regLookbehind = true (and even then I don't know if 'regMustString' can be in a look-behind), then the regMustString must appear after AOffset?

Alexey-T commented 1 year ago

the regMustString must appear after AOffset?

yes, it looks like a bug. thanks. can you pls, prepare the fix? i can fix but no time to test all stuff.

User4martin commented 1 year ago

I don't know so, if the regMustString can have been found in a look behind.

          if PREOp(scan)^ = OP_EXACTLY then

I.e. can OP_EXACTLY be found if the text is in a look behind...

So to be carefully, if there are lookbehind, the current behaviour needs to be kept.


Though if the user can set IinputEnd, and then possible fInputStart too, then it can never be before fInputStart

So depending how much I do on the original issue, this may be going together.

Alexey-T commented 1 year ago

regMustString must be found in char (ie TRegExprChar) buffer between fInputStart and fInputEnd. PosEx() is good, it can use StartPos param, but we also need EndPos param for our search.

User4martin commented 1 year ago

Also this is offtopic to the original, in CompileRegExpr there is the loop to find regMustString

        longest := nil;
        Len := 0;
        while scan <> nil do
        begin
          if PREOp(scan)^ = OP_EXACTLY then
          begin
            longestTemp := scan + REOpSz + RENextOffSz + RENumberSz;

So other (shorter) "must exist" strings can be found before. If they are not in look-behind/ahead (if it can be known whether they are), then I would thing they have to appear before the regMustString. Though I may be overlooking some detail... If that is the case, then there total length can be accumulated, and the regMustString can be looked for at that extra offset. This is probably not going to amount to much, but... (equally any not chosen candidates after it can form an offset to the end)

If those thoughts make sense, they need to move to a new issue.

Alexey-T commented 1 year ago

sorry, this post don't make sense to me. i don't get the idea. this another place while scan <> nil do better be not touched.

Alexey-T commented 1 year ago

regMustString is a plain text which must exist inside the match. it is formed durring compilation. e.g. if we find regex "\b12[345]\b" then regMustString is '12'.

Alexey-T commented 1 year ago

what is use-case here? Martin, you will use it with SynEdit app. SynEdit has collection of lines. you will need to convert it to a single string, ok? UnicodeString (to match TRegExprString). so why do u need InputStart/End here? you can always pass the entire buffer string. like I do in my CudaText.

User4martin commented 1 year ago

I am looking at writing an engine for TextMate grammar highlighting.

The use case are speed improvements.

The specifics are:

This save calling copy(), and making an extra copy of that part of the text.


It is dampened by TRegEx calling UniqueString. Which still makes a copy. But which currently makes a 2nd copy.

procedure TRegExpr.SetInputString(const AInputString: RegExprString);
begin
  ClearMatches;

  fInputString := AInputString;
  UniqueString(fInputString);

Even with const param: AInputString has a refcount of at least 1 (the caller is holding it / without "const" it would be at least 2)

Alexey-T commented 1 year ago

I will remove (comment) UniqueString - 2 calls in unit. I dont recalll how they helped me.

User4martin commented 1 year ago

I will remove (comment) UniqueString - 2 calls in unit. I dont recalll how they helped me.

Ok thanks. Just to confirm, I did a few tests (with all 3 of them removed / the 3rd in my pull request).

1) All tests passed. 2) None of the strings where changed

AreEqual('unchanged', T.InputText, RE.InputString);
AreEqual('unchanged', T.Expression, RE.Expression);

The only reason I can think of a UniqueString may be needed, would be if the regex engine would be modifying the input string (or expression string).
=> In that case, the string would move and the pchar point to invalidate data

Or if the change was made via a PChar, then the string would not trigger copy on write, and other holders (of a reference to this text) would have their content changed.

But I could not find either of that happen.


If the string is changed outside the regex-engine, while fInputString refers it, that is no problem and does not need UniqueString.

Alexey-T commented 1 year ago

engine does not change the InputString , nor the RegEx string. please comment all UniqueStrings in your PR.

pls, do PR to my repo. https://github.com/Alexey-T/TRegExpr

User4martin commented 1 year ago

please comment all UniqueStrings in your PR.

Commented the one I added. Or shall I comment the pre-existing too?

Alexey-T commented 1 year ago

I will remove them later. ok yet.

Alexey-T commented 1 year ago

@User4martin Let's close the topic?