andgineer / TRegExpr

Regular expressions (regex), pascal.
https://regex.sorokin.engineer/en/latest/
MIT License
174 stars 63 forks source link

Upgrade regular expression engine to support features of major flavours like PCRE2, ECMAScript etc. #285

Closed mcarans closed 1 year ago

mcarans commented 2 years ago

Regular expressions that work in online tools like https://regex101.com/ do not in TRegExpr which is rather frustrating.

For example, I wanted to match a string that does not start with something. I looked for examples and found this 2009 Stackoverflow answer. I tried the first two options in the answer (one using negative lookahead and one using negative lookbehind) via CudaText and neither work. Both give error: lookaround brackets must be at the very beginning/ending

I checked the documentation and TRegExpr seems to have some major limitations:

Limitations:

Brackets for lookahead must be at the very ending of expression, and brackets for lookbehind must be at the very beginning. So assertions between choices |, or inside groups, are not supported.
For lookbehind (?<!foo)bar, regex “foo” must be of fixed length, ie contains only operations of fixed length matches. Quantifiers are not allowed, except braces with the repeated numbers {n} or {n,n}. Char-classes are allowed here, dot is allowed, \b and \B are allowed. Groups and choices are not allowed.

I would like to be able to construct a regular expression in an online regex tester and have it work in TRegExpr. Is it possible to fix these limitations and bring the regular expression engine up to date so that it supports the full range of features of major flavours like PCRE2, ECMAScript etc.?

Help is needed to resolve this issue which relates to one on CudaText

User4martin commented 1 year ago

@Alexey-T

Currently working on some part of this issue. Improved support for look-around. => unlimited and nested positive and negative look-ahead and behind. Including variable length look-behind. Well almost.

On that part the question is what is acceptable

In Look behind (as far as I can deduct from some tests) loops (greedy or non-greedy) are evaluated backwards.

https://regex101.com/r/h0TzgP/1 (?<=(A??a*))X => AaaaaX

If you match that left-to-right, there is no (reasonable) way to find that.

https://regex101.com/r/oldaRx/1 (?<=(a*)(a*))X => aaaaX It is group 2 that gets all the "a" (forward would have been group 1)

To match those particular pattern, one would need to compile the reg-ex sub=expression backwards, and apply it backwards.

In any case currently we match from left to right.

In that case, the content of the capture may not be what it should be. Actually:

But

So, is that acceptable?

Or should look-behind be restricted to be any of

And fail to compile for (having all of):

--- EDIT Actually multiple non-greedy can also get a capture to the wrong pos.

So "capture" and "variable len" don't work together.

User4martin commented 1 year ago

Current preview https://github.com/User4martin/TRegExpr/tree/lookaround (does not yet detect fixed len)

Alexey-T commented 1 year ago

Or should look-behind be restricted to be any of

If your talent+time will allow to run matches backward, it will be good. if not, I can live with restrictions: fail to compile certain operators in the lookbehind.

User4martin commented 1 year ago

I already got them running (var-len, and backward) => just if captures are used together with var len, the capture could match at the wrong pos.

So that may need to be disabled. Or at least be made an option (for many patterns they will still work correct).

Alexey-T commented 1 year ago

So that may need to be disabled

right...

Alexey-T commented 1 year ago

@CaptainFlint You can suggest new version 1.163 to Ghisler.

Alexey-T commented 1 year ago

@andgineer Let's close this? look-around now can be anywhere.