andgineer / TRegExpr

Regular expressions (regex), pascal.
https://regex.sorokin.engineer/en/latest/
MIT License
174 stars 63 forks source link

$DEFINE UnicodeRE off-> tests fail #343

Open Alexey-T opened 11 months ago

Alexey-T commented 11 months ago

Martin, can you adjust tests project to not fail with UnicodeRE off? You are really good in composing tests. I admit. @user4martin

Alexey-T commented 11 months ago

Test-proj has it's own UnicodeRE define copy. but even if i disable it, tests fail! @user4martin

User4martin commented 11 months ago

Well, haven't checked background yet, but

    // 69
    ( // empty str
    expression: '^ *$';
    inputText: '';
    substitutionText: '';
    expectedResult: '';
    matchStart: 1

fails for me => and that looks like a bug in the regex engine.


The others are down to the define.

Ideally the defines need to move into their own include files.

Then (and I can check that) the failing test may need to be disabled.

The [-] range of Russian chars seems not to be implemented for utf8 yet. Possible, but an issue of its own (and not necessary one that would have my time soon).

The #%85 line break => same thing. But maybe can be fixed easy for utf8.

User4martin commented 11 months ago

IsAnyLineBreak could be changed to take a pointer to ReChar.

Then it could return zero, or the length of any matched line break. That way it could handle utf-8 encoded line breaks of more than one byte.

The test case would then need to be changed to have #$C2#$85 in the string.

Alexey-T commented 11 months ago

Then it could return zero, or the length of any matched line break.

do code need this really, if it works good already? only more complex logic.

User4martin commented 11 months ago

Then it could return zero, or the length of any matched line break.

do code need this really, if it works good already? only more complex logic.

Well, is "not implemented" = works good?

At the moment, using the utf-8 version, Linebreaks like "'NEXT LINE (NEL)' (U+0085)" are simple not detected. utf-8 is unicode, so those codes do exist.

That is unless it is meant to be ASCII? Then a utf-8 version is really needed. (And afaik there is more to be fixed for proper utf8 support, but this would be a start)

Alexey-T commented 11 months ago

so it is needed, okay.

Alexey-T commented 11 months ago

but is it needed that in non-Unicode mode we must find pure Unicode linebreak? we can ignore chr(85) in non-Unicode mode, logical.

User4martin commented 11 months ago

but is it needed that in non-Unicode mode we must find pure Unicode linebreak? we can ignore chr(85) in non-Unicode mode, logical.

IMHO: Wrong Question. Utf8 is also a Unicode mode.

The question is: Does the regex currently have an ASCII (non Unicode) or an Ut8 (Unicode) mode?

But, IMHO the answer does not matter. IMHO a Utf8 mode is what is needed.

So then the only question is: Utf8 mode: Add or Fix?

Alexey-T commented 11 months ago

So then the only question is: Utf8 mode: Add or Fix?

Add.