andgineer / TRegExpr

Regular expressions (regex), pascal.
https://regex.sorokin.engineer/en/latest/
MIT License
174 stars 63 forks source link

FindRepeatead and Unicode / may break OP_STAR/PLUS/... #371

Open User4martin opened 10 months ago

User4martin commented 10 months ago

I have not further analysed this...

FindRepeated (for unicode) calls IncUnicode2 which may (for surrogates) increment by 2. For the OPs that can match a surrogate this will be a problem.

OP_STAR/.... in MatchPrim will iterate the returned range in steps of one ReChar (codeunit): regInput := save + no;

Also the result of FindRepeated may be the


One way I can think of (.+).

OP_STAR goes back half the surrogate, and then OP_ANY does not check that it matches the 2nd part of a surrogate


This may be fixable (but I have not tested)

Alexey-T commented 10 months ago

On what case (RE, text) does engine fail currently?

User4martin commented 10 months ago

I only deducted from code review. But https://www.compart.com/de/unicode/U+10000

IsNotMatching('surrogat', '.+.', #$D800#$DC00); fails (it will match).

This is one char. so the .+ should entirely consume it, and leave nothing for the extra ..

Btw, same issue with combining codepoints.


on https://regex101.com/ not all regex handle this either (Python, GoLang, Java seem to do)