Open User4martin opened 10 months ago
On what case (RE, text) does engine fail currently?
I only deducted from code review. But https://www.compart.com/de/unicode/U+10000
IsNotMatching('surrogat', '.+.', #$D800#$DC00);
fails (it will match).
This is one char. so the .+
should entirely consume it, and leave nothing for the extra .
.
Btw, same issue with combining codepoints.
on https://regex101.com/ not all regex handle this either (Python, GoLang, Java seem to do)
I have not further analysed this...
FindRepeated
(for unicode) callsIncUnicode2
which may (for surrogates) increment by 2. For the OPs that can match a surrogate this will be a problem.OP_STAR/.... in MatchPrim will iterate the returned range in steps of one ReChar (codeunit):
regInput := save + no;
Also the result of
FindRepeated
may be theOne way I can think of
(.+).
OP_STAR goes back half the surrogate, and then OP_ANY does not check that it matches the 2nd part of a surrogate
This may be fixable (but I have not tested)
regInput := save + no;
points to the 2nd part of a surrogateFindRepeated
always most return the amount of codeunits (ReChars) / always counting a surrogate as 2.