andgineer / TRegExpr

Regular expressions (regex), pascal.
https://regex.sorokin.engineer/en/latest/
MIT License
174 stars 63 forks source link

Another invalid regex which is working #155

Closed SlMaker closed 4 years ago

SlMaker commented 4 years ago

from test 29: .*?\b(https?|ftp)://(?:\w+)\.(?:\w+)\.(\w+)

https://regex101.com/ says: / An unescaped delimiter must be escaped with a backslash (\)

Alexey-T commented 4 years ago

it is false error. / slash must not be escaped (it's optional for it).

andgineer commented 4 years ago

@SlMaker just select for example 'Python' flavour on https://regex101.com/ and the expression will work.

May be we have a bug in https://regex.sorokin.engineer/en/latest/regular_expressions.html ? Do you see someplace were it said "you should escape slash (not backslash)"?

SlMaker commented 4 years ago

Maybe you should note which flavour you're following. Thought it is PCRE which is the most used one...

Alexey-T commented 4 years ago

@andgineer decide what to do with dot inside [].

andgineer commented 4 years ago

@Alexey-T this question is about / what do you mean about dot inside [] ?

andgineer commented 4 years ago

Maybe you should note which flavour you're following. Thought it is PCRE which is the most used one...

Well you know I wrote the lib more than 20 years ago.. So this is not clear am I follow or vice versa :)

I have documentation exactly about this implementation on regex.sorokin.engineer and in fact I have never compared that with some specific flavour. May be I should do that, you are right - a lot of things changed in so many years..

SlMaker commented 4 years ago

@andgineer see PR #157 where the regex with dot inside a set give wrong and totally unexpected results.

SlMaker commented 4 years ago

It didn't behaved like that with dots inside sets before which breaks some regexes for me

andgineer commented 4 years ago

could you give exact example?

SlMaker commented 4 years ago

gave them in #157: RunTest43 Failed: Replace failed expected: <- . -> but was: <Pictures-2018.Spain-Madrid> RunTest44 Failed: Replace failed expected: <.Test.> but was: <This.Is.A.Test.1234_Test_abc> see https://travis-ci.org/github/andgineer/TRegExpr/jobs/688913370, don't know why it says success if two tests failed! regex ([.-]Test[.-]) shouldn't return This.Is.A.Test.1234_Test_abc, it should only match .Test. as every other regex on regex101 does and also the older versions of TRegExpr.

andgineer commented 4 years ago

@SlMaker really strange - test failes but exit code 0.. I am not familiar with the test framework but it did report fails as non-zero code before..

@Alexey-T could you help with debugging the failed testcases? RE pseudo-code looks ok but TRegExpr just do not match anything so it returns initial string - Replace changes only parts of the input string that match re

SlMaker commented 4 years ago

For Delphi it returns an exit code but the FPC tests doesn't. But uses different test frameworks so don't know how to fix

Alexey-T commented 4 years ago

finder is fixed in my PR today. now last RE test fails. Test44. why? it calls finder, then calls Replace. finder works ok. Replace (it calls Substitute) gets whole string as result - it is some bug. I didnt find it. yet @andgineer

Alexey-T commented 4 years ago

maybe ALL replace-tests are wrong. see this:

    // 2
    (
    expression: '(\w*)';
    inputText: 'name.ext';
    substitutionText: '$1.new';
    expectedResult: 'name.new.new.ext.new.new';
    MatchStart: 0
    ),

Why? such result ok? it's not my work (regex find code work).

SlMaker commented 4 years ago

Testing it on https://regex101.com/r/HblhQF/1 gives also 4 matches: name, empty, ext, empty

name match: name.new empty match: .new ext match: ext.new empty match: .new but the dot between new and ext shouldn't be there?

andgineer commented 4 years ago

yes Replace replaces only matched part an leave all others untouched..

andgineer commented 4 years ago

And all the tests was green before..

andgineer commented 4 years ago

And last test still does not work - just do not see any matches

Test: This.Is.A.Test.1234_Test_abc
  Modifiers "gsr-imx"
  Regular expression: ([.-]Test[.-]) ,
  compiled into p-code:
   1:BRANCH (88)
10:OPEN[1] (19)
19:BRANCH (79)
28:ANYOF (45) Ch(.-)
45:EXACTLY (62) Test
62:ANYOF (79) Ch(.-)
79:CLOSE[1] (88)
88:END (0)

First charset: -.

  Expression found at: -1 
andgineer commented 4 years ago

BTW this is wrong - . without escaping should be metachar any char and not just dot itself as in opcode

Alexey-T commented 4 years ago

@andgineer you tested wrong somehow, see my test with last TRegexpr Screenshot from 2020-05-22 20-35-52

Alexey-T commented 4 years ago

last test fails? yes, and I cannot get why. some bug in Replace? delete Test44 please.

SlMaker commented 4 years ago

Why deleting the test? The test is correct. Better fix the bug!

SlMaker commented 4 years ago

So it is intended that / does not to be escaped? If yes, please document and close this issue

andgineer commented 4 years ago

Sorry I am confused - where to document this? In the documentation now: https://regex.sorokin.engineer/en/latest/regular_expressions.html#escaping

you can prefix (or escape) with \ any character that has special meaning in regular expressions.

but slash / does not have any special meaning in re..

SlMaker commented 4 years ago

If you use the regex .*?\b(https?|ftp)://(?:\w+)\.(?:\w+)\.(\w+) from above it'll result in an error for PCRE and Javascript flavour on regex101. Probably because you can do: \stest\s/i for case-insensitive. See also https://stackoverflow.com/questions/6076229/escaping-a-forward-slash-in-a-regular-expression where they say it's for pattern delimiter and need escaping depending on flavour.

andgineer commented 4 years ago

yes you are right - they use /i and so on in TRegExpr we use (?i) https://regex.sorokin.engineer/en/latest/regular_expressions.html#imsgxr-imsgxr

and slash does not have any special meaning so you doe not HAVE TO escape it. but if you want regex compatible with other regex flavours you CAN escape it and TRegExpr return the same result as without escaping.