Closed User4martin closed 1 year ago
we have private methods
procedure TRegExpr.SetInputString(const AInputString: RegExprString);
begin
ClearMatches;
fInputString := AInputString;
UniqueString(fInputString);
fInputStart := PRegExprChar(fInputString);
fInputEnd := fInputStart + Length(fInputString);
end;
procedure TRegExpr.SetInputRange(AStart, AEnd: PRegExprChar);
begin
fInputString := '';
fInputStart := AStart;
fInputEnd := AEnd;
end;
so i suggest to add SetInput public method:
SetInput(const AInputString: RegExprString; AStartPos, ALen: integer);
can you make the patch and test it?
pls use our current coding style.
I have a look...
Just started to see, how much FInputString
is used...
Would I be right to say that
function TRegExpr.ExecPrim(AOffset: integer;
ATryOnce, ASlowChecks, ABackward: boolean): boolean;
begin
...
Ptr := fInputStart + AOffset - 1;
// If there is a "must appear" string, look for it.
if ASlowChecks then
if regMustString <> '' then
if Pos(regMustString, fInputString) = 0 then Exit;
the Pos(regMustString, fInputString)
is wrong (inefficent)?
Unless regLookbehind = true
(and even then I don't know if 'regMustString' can be in a look-behind), then the regMustString
must appear after AOffset?
the regMustString must appear after AOffset?
yes, it looks like a bug. thanks. can you pls, prepare the fix? i can fix but no time to test all stuff.
I don't know so, if the regMustString
can have been found in a look behind.
if PREOp(scan)^ = OP_EXACTLY then
I.e. can OP_EXACTLY
be found if the text is in a look behind...
So to be carefully, if there are lookbehind, the current behaviour needs to be kept.
Though if the user can set IinputEnd
, and then possible fInputStart
too, then it can never be before fInputStart
So depending how much I do on the original issue, this may be going together.
regMustString must be found in char (ie TRegExprChar) buffer between fInputStart and fInputEnd. PosEx() is good, it can use StartPos param, but we also need EndPos param for our search.
Also this is offtopic to the original, in CompileRegExpr
there is the loop to find regMustString
longest := nil;
Len := 0;
while scan <> nil do
begin
if PREOp(scan)^ = OP_EXACTLY then
begin
longestTemp := scan + REOpSz + RENextOffSz + RENumberSz;
So other (shorter) "must exist" strings can be found before. If they are not in look-behind/ahead (if it can be known whether they are), then I would thing they have to appear before the regMustString
. Though I may be overlooking some detail...
If that is the case, then there total length can be accumulated, and the regMustString
can be looked for at that extra offset.
This is probably not going to amount to much, but... (equally any not chosen candidates after it can form an offset to the end)
If those thoughts make sense, they need to move to a new issue.
sorry, this post don't make sense to me. i don't get the idea.
this another place while scan <> nil do
better be not touched.
regMustString is a plain text which must exist inside the match. it is formed durring compilation. e.g. if we find regex "\b12[345]\b" then regMustString is '12'.
what is use-case here? Martin, you will use it with SynEdit app. SynEdit has collection of lines. you will need to convert it to a single string, ok? UnicodeString (to match TRegExprString). so why do u need InputStart/End here? you can always pass the entire buffer string. like I do in my CudaText.
I am looking at writing an engine for TextMate grammar highlighting.
The use case are speed improvements.
The specifics are:
re.Exec(CurrentOffsetInLine)
(this exists)re.InputString := copy(line, start, len)
and ^
/ $
shall match accordingly / Also lookaround should respect the bounds. (So start
isn't the same as offset
) / (currently start is always 1, but end can be earlier / however, if implemented, then make it generic, start may be used in future)This save calling copy()
, and making an extra copy of that part of the text.
It is dampened by TRegEx calling UniqueString. Which still makes a copy. But which currently makes a 2nd copy.
procedure TRegExpr.SetInputString(const AInputString: RegExprString);
begin
ClearMatches;
fInputString := AInputString;
UniqueString(fInputString);
Even with const param
: AInputString
has a refcount of at least 1 (the caller is holding it / without "const" it would be at least 2)
fInputString := AInputString;
=> increases the refcount to at least 2UniqueString(fInputString);
=> always makes a copyI will remove (comment) UniqueString - 2 calls in unit. I dont recalll how they helped me.
I will remove (comment) UniqueString - 2 calls in unit. I dont recalll how they helped me.
Ok thanks. Just to confirm, I did a few tests (with all 3 of them removed / the 3rd in my pull request).
1) All tests passed. 2) None of the strings where changed
AreEqual('unchanged', T.InputText, RE.InputString);
AreEqual('unchanged', T.Expression, RE.Expression);
The only reason I can think of a UniqueString may be needed, would be if the regex engine would be modifying the input string (or expression string).
=> In that case, the string would move and the pchar point to invalidate data
Or if the change was made via a PChar, then the string would not trigger copy on write, and other holders (of a reference to this text) would have their content changed.
But I could not find either of that happen.
If the string is changed outside the regex-engine, while fInputString refers it, that is no problem and does not need UniqueString.
engine does not change the InputString , nor the RegEx string. please comment all UniqueStrings in your PR.
pls, do PR to my repo. https://github.com/Alexey-T/TRegExpr
please comment all UniqueStrings in your PR.
Commented the one I added. Or shall I comment the pre-existing too?
I will remove them later. ok yet.
@User4martin Let's close the topic?
An application may have a long string, and needs to apply a pattern to a substring, such that
^
and$
will match begin/end of the substring.Currently the application must make a copy of the substring
RegEx.InputString := copy(text, a, b);
It would be nice to save the time of copying strings around.Alternatively the TRegEx could allow to set InputString from a PChar /PWideChar. And specify the length (instead of looking for a terminating #0)