andgineer / TRegExpr

Regular expressions (regex), pascal.
https://regex.sorokin.engineer/en/latest/
MIT License
174 stars 62 forks source link

First charset #358

Closed User4martin closed 10 months ago

User4martin commented 10 months ago

For a regex (?=abc) the "a" from the look ahead is a valid first char.

Further (?=[abc])[ced] can only have "c" as first char (otherwise either the lookahead, or the main expression will fail)


I did forgot some code when I optimized the branches. Since dummy (non-choice) branches are now removed, there no longer is a need to check for them.

User4martin commented 10 months ago

Added an improvement to the loop searching for the first char.

Moving some conditions outside the loop, so they don't need to be checked on each iteration.

Running 8 repeats for each benchmark. Use -c <n> to change

                                                        Time        Before |      After | Match count
========================================================================================================
                                                  /Twain/ :          41 ms |      31 ms |        2388
                                              /(?i)Twain/ :          56 ms |      46 ms |        2657
                                             /[a-z]shing/ :         212 ms |     207 ms |        1877
                             /Huck[a-zA-Z]+|Saw[a-zA-Z]+/ :          41 ms |      27 ms |         396
                                                      /./ :         408 ms |     408 ms |    18905427
                                                    /(.)/ :         634 ms |     623 ms |    18905427
                                                      /e/ :          84 ms |      74 ms |     1781425
                                                    /(e)/ :         103 ms |      93 ms |     1781425
                                           /(?s).{1,45} / :          35 ms |      35 ms |      475715
                                         /(?s)\G.{1,45} / :          15 ms |      17 ms |       10616
                                         /\G(?s).{1,45} / :          15 ms |      15 ms |       10616
                                          /(?s).{1,45}? / :         148 ms |     154 ms |     3241534
                                        /(?s)\G.{1,45}? / :          17 ms |      17 ms |       69431
                                        /\G(?s).{1,45}? / :          19 ms |      17 ms |       69431
                                              /\b\w+nn\b/ :         207 ms |     201 ms |         359
                                       /[a-q][^u-z]{13}x/ :         615 ms |     617 ms |        4929
                            /Tom|Sawyer|Huckleberry|Finn/ :          41 ms |      31 ms |        3015
                        /(?i)Tom|Sawyer|Huckleberry|Finn/ :         146 ms |     132 ms |        4820
                    /.{0,2}(Tom|Sawyer|Huckleberry|Finn)/ :        1636 ms |    1621 ms |        3015
                    /.{2,4}(Tom|Sawyer|Huckleberry|Finn)/ :        1826 ms |    1845 ms |        2220
                      /Tom.{10,25}river|river.{10,25}Tom/ :          56 ms |      41 ms |           2
                                           /[a-zA-Z]+ing/ :         544 ms |     546 ms |       95863
                                  /\s[a-zA-Z]{0,12}ing\s/ :         261 ms |     244 ms |       67810
                          /([A-Za-z]awyer|[A-Za-z]inn)\s/ :         525 ms |     501 ms |         313
                              /["'][^"']{0,30}[?!\.]["']/ :          83 ms |      70 ms |        9857
                                /Tom(.{3,3}|.{5,5})*Finn/ :        2828 ms |    2750 ms |          11
                                    /Tom(...|.....)*Finn/ :        1242 ms |    1357 ms |          11
                                   /Tom(...|.....)*?Finn/ :        1345 ms |    1332 ms |          11
                      /Tom((...|.....){2,9}\s){1,5}?Finn/ :        3613 ms |    3586 ms |          11
                      /Tom((...|.....){2,9}?\s){1,5}Finn/ :        3578 ms |    3568 ms |          11
                    /Tom((?:...|.....){2,9}\s){1,5}?Finn/ :        2808 ms |    2847 ms |          11
                    /Tom((?:...|.....){2,9}?\s){1,5}Finn/ :        2820 ms |    2822 ms |          11
                    /Tom((?>...|.....){2,9}\s){1,5}?Finn/ :          41 ms |      31 ms |           2
                    /Tom((?>...|.....){2,9}?\s){1,5}Finn/ :          41 ms |      29 ms |           2
                                        /\G(?is).(?=.*$)/ :         824 ms |     830 ms |    20045118
                                /\G(?is).(?=(.){1,5}?$)?/ :        3103 ms |    3076 ms |    20045118
                                       /\G(?is).(?=.*+$)/ :         830 ms |     849 ms |    20045118
                /\G(?is).{10,10}(?=(e|y|on|fin|.){0,20})/ :        1404 ms |    1408 ms |     2004511
              /\G(?is).{10,10}(?=(?>e|y|on|fin|.){0,20})/ :        1408 ms |    1402 ms |     2004511
                                          /Tom(?!.*Finn)/ :          41 ms |      29 ms |        2441
     /(?i)(?>[aeoui][bcdfghjklmnpqrstvwxyz \r\n]*){1,40}/ :         611 ms |     607 ms |      579843
                          /(?i)[ bdlm][abdegij][mnoprst]/ :         226 ms |     216 ms |      788955
Total:                                                            34546 ms |   34367 ms |
User4martin commented 10 months ago

Hold it => I forgot to run the test Now, tests pass.

Alexey-T commented 10 months ago

Hm. https://www.rexegg.com/regex-lookarounds.html

(?=foo) Lookahead Asserts that what immediately follows the current position in the string is foo

and you write "For a regex (?=abc) the "a" from the look ahead is a valid first char." this (?=abc) is full regex? nonsense. it must be part only. after some other part.

User4martin commented 10 months ago

Hm. https://www.rexegg.com/regex-lookarounds.html

(?=foo) Lookahead Asserts that what immediately follows the current position in the string is foo

and you write "For a regex (?=abc) the "a" from the look ahead is a valid first char." this (?=abc) is full regex? nonsense. it must be part only. after some other part.

(?=[abc]) can be a full regex https://regex101.com/r/v8Ejfp/1 On the text 123a56 it matches a 0 len string at pos 3 (TRegEx pos=4 / 1 based). But the "a" must be present in the next char.

In ExexPrim the Ptr starts at fInputString Ptr^='1' => when it reaches Ptr^='a' then it starts the match. Then the first (and at that time also next) char is "a" => and the regex will match.

Mind that "the current position" can either be read as


Using a regex like (?=[abc]) can make sense. If you want to find all occurrences of a pattern.

As brought up recently the "global" modifier (match all == ExecNext) means that in case of a 0 len match, the pos increments by 1. Btw ExecNext has

  // prevent infinite looping if empty string matches r.e.
  if PtrBegin = PtrEnd then
    Inc(Offset);

So if you have (?=(?:abba|a)) then an the text "abc abba abc`, it will match at all 4 "a", including the "a" at the end of abba. Without the look ahead, It would only match 3 times.

Alexey-T commented 10 months ago

Okay, I got it, thanks.

Alexey-T commented 10 months ago
    // Dig out information for optimizations.
    IsFixedLengthEx(op, FMinMatchLen, MaxMatchLen);

Here 2 or 3 uninited vars are passed to function

Alexey-T commented 10 months ago
     OP_LOOKAHEAD:
        begin
          opnd := PRegExprChar(AlignToPtr(Next + 1)) + RENextOffSz;
          Next := regNextQuick(Next);
          FillFirstCharSet(Next);
          if opnd^ = OP_LOOKAROUND_OPTIONAL then
            Exit;

seems the last "if" is redundant.

User4martin commented 10 months ago
    // Dig out information for optimizations.
    IsFixedLengthEx(op, FMinMatchLen, MaxMatchLen);

Here 2 or 3 uninited vars are passed to function

Those params are "out param" => but IIRC some older compiler does not support this => therefore they are declared var param.


seems the last "if" is redundant.

(?=[abc])? The expression matches, even if the look-ahead fails. So we can't use the [abc] as first chars.

At the point of the "if opnd" the first-char array has been filled for whatever is after the look ahead. E.g. (?=[abc])?[bcd] => first-char-array will have [bcd]. And we can't limit that by doing [abc] AND [bcd].

If the look ahead is NOT optional (?=[abc])[bcd], then the first char must match both => so first char is only "bc"

User4martin commented 10 months ago

seems the last "if" is redundant.

Or did you mean opnd vs next?

next can point to a later op. Eg. a|(?=b)|c => next will point to the end of the branch (behind the last alternative). opnd - in this case - is always the statement following in the stream.


The problem is caused by how the parser works. The optional info needs to be appended, after the entire look-ahead has been written. That is why a dummy-opcode OP_LOOKAROUND_OPTIONAL is inserted.

Well, truth to be told, with some effort, it may be possible to rewrite that and update the original op. But at the moment, the dummy op is inserted.

Alexey-T commented 10 months ago

"if" is redundant

sorry, you're ok.