andgineer / TRegExpr

Regular expressions (regex), pascal.
https://regex.sorokin.engineer/en/latest/
MIT License
174 stars 62 forks source link

various optimizations #357

Closed User4martin closed 10 months ago

User4martin commented 10 months ago

Most savings are from "Don't omit OP_OPEN/CLOSE for (?: " (see the +++ in the table) Yet their is a bit of speed gain through the other commits too.

Running 2 repeats for each benchmark. Use -c <n> to change
                                                           Time     before |      after    | Match count
=======================================================================================================
                                                  /Twain/ :          31 ms |     39 ms     |        2388
                                              /(?i)Twain/ :          54 ms |     55 ms     |        2657
                                             /[a-z]shing/ :         226 ms |    210 ms     |        1877
                             /Huck[a-zA-Z]+|Saw[a-zA-Z]+/ :          31 ms |     31 ms     |         396
                                                      /./ :         445 ms |    398 ms     |    18905427
                                                    /(.)/ :         672 ms |    617 ms     |    18905427
                                                      /e/ :          78 ms |     70 ms     |     1781425
                                                    /(e)/ :         101 ms |     93 ms     |     1781425
                                           /(?s).{1,45} / :          31 ms |     39 ms     |      475715
                                         /(?s)\G.{1,45} / :         140 ms |     16 ms     |       10616
                                         /\G(?s).{1,45} / :           8 ms |     15 ms     |       10616
                                          /(?s).{1,45}? / :         156 ms |    140 ms     |     3241534
                                        /(?s)\G.{1,45}? / :         141 ms |     15 ms     |       69431
                                        /\G(?s).{1,45}? / :          15 ms |     15 ms     |       69431
                                              /\b\w+nn\b/ :         218 ms |    211 ms     |         359
                                       /[a-q][^u-z]{13}x/ :         609 ms |    593 ms     |        4929
                            /Tom|Sawyer|Huckleberry|Finn/ :          31 ms |     39 ms     |        3015
                        /(?i)Tom|Sawyer|Huckleberry|Finn/ :         140 ms |    141 ms     |        4820
                    /.{0,2}(Tom|Sawyer|Huckleberry|Finn)/ :        1601 ms |   1617 ms     |        3015
                    /.{2,4}(Tom|Sawyer|Huckleberry|Finn)/ :        1797 ms |   1820 ms     |        2220
                      /Tom.{10,25}river|river.{10,25}Tom/ :          47 ms |     47 ms     |           2
                                           /[a-zA-Z]+ing/ :         570 ms |    547 ms     |       95863
                                  /\s[a-zA-Z]{0,12}ing\s/ :         265 ms |    257 ms     |       67810
                          /([A-Za-z]awyer|[A-Za-z]inn)\s/ :         515 ms |    555 ms     |         313
                              /["'][^"']{0,30}[?!\.]["']/ :          70 ms |     70 ms     |        9857
                                /Tom(.{3,3}|.{5,5})*Finn/ :        2711 ms |   2726 ms     |          11
                                    /Tom(...|.....)*Finn/ :        1226 ms |   1219 ms     |          11
                                   /Tom(...|.....)*?Finn/ :        1352 ms |   1343 ms     |          11
                      /Tom((...|.....){2,9}\s){1,5}?Finn/ :        3585 ms |   3570 ms     |          11
                      /Tom((...|.....){2,9}?\s){1,5}Finn/ :        3664 ms |   3570 ms     |          11
                    /Tom((?:...|.....){2,9}\s){1,5}?Finn/ :        3601 ms |   2812 ms +++ |          11
                    /Tom((?:...|.....){2,9}?\s){1,5}Finn/ :        3671 ms |   2812 ms +++ |          11
                    /Tom((?>...|.....){2,9}\s){1,5}?Finn/ :          39 ms |     39 ms     |           2
                    /Tom((?>...|.....){2,9}?\s){1,5}Finn/ :          31 ms |     31 ms     |           2
                                        /\G(?is).(?=.*$)/ :         945 ms |    828 ms     |    20045118
                                /\G(?is).(?=(.){1,5}?$)?/ :        3320 ms |   3211 ms     |    20045118
                                       /\G(?is).(?=.*+$)/ :         922 ms |    843 ms     |    20045118
                /\G(?is).{10,10}(?=(e|y|on|fin|.){0,20})/ :        1382 ms |   1390 ms     |     2004511
              /\G(?is).{10,10}(?=(?>e|y|on|fin|.){0,20})/ :        1391 ms |   1391 ms     |     2004511
                                          /Tom(?!.*Finn)/ :          39 ms |     39 ms     |        2441
     /(?i)(?>[aeoui][bcdfghjklmnpqrstvwxyz \r\n]*){1,40}/ :         570 ms |    554 ms     |      579843
                          /(?i)[ bdlm][abdegij][mnoprst]/ :         219 ms |    219 ms     |      788955
Total:                                                            36672 ms |  34257 ms     |

A (?: has only an effect on either encapsulating branchs | or loops => in both cases - once parsed - the op codes of those correctly represent start and end of the sub-pattern. Of course the output of Dump is less human readable. But that must be expected, if an optimizer makes changes.

(?:a|b)|(?:\d|\w)\s|(?:12)+

 1: BRANCH (53)                        1: G_BRANCH (43)    
 8: OPEN (15)  [1]                                         no recursion needed, the (?: is ended by OP_COMMENT
15:   G_BRANCH_EX (31)                 8: G_BRANCH_EX (23) 
22:   EXACTLY (46) :a                 15: EXACTLY (38) a   
31:   G_BRANCH_EX (46)                23: G_BRANCH_EX (38) 
38:   EXACTLY (46) b                  30: EXACTLY (38) b   
46: CLOSE (165)  [1]                  38: COMMENT (125)     => OP_END (end of sub-expr)
53: BRANCH (103)                      43: BRANCH (84)      
60: OPEN (67)  [2]                                         no recursion needed, the (?: is ended by OP_COMMENT
67:   BRANCH (79)                     50: BRANCH (62)      
74:   ANYDIGIT (91)                   57: ANYDIGIT (74)    
79:   BRANCH (91)                     62: BRANCH (74)      
86:   ANYLETTER (91)                  69: ANYLETTER (74)   
91: CLOSE (98)  [2]                   74: COMMENT (79)     
98: ANYSPACE (165)                    79: ANYSPACE (125)    => OP_END (end of sub-expr)
103: BRANCH (165)                     84: G_BRANCH_EX (125) => OP_END (begin last sub-expr in outer branch)
110: OPEN (117)  [3]                                       no recursion needed
117:   BRANCH (134)                                        
124:   EXACTLY (134) 12               91: EXACTLY (101) 12
134: CLOSE (141)  [3]                                      
141: BRANCH (153)                     101: BRANCH (113)    
148: BACK (110)                       108: BACK (91)       
153: BRANCH (160)                     113: BRANCH (120)    
160: NOTHING (165)                    120: NOTHING (125)    => OP_END (end of sub-expr)
165: END (0)                          125: END (0)         

The OP_COMMENT is needed, if a branch has several sub-expr. They all need to point to an ender before returning to the parent (removing this is possible, but currently to complicated) The last segment has no OP_COMMENT, as there is no OP_BRANCH either. The (?: is handled by the loop.

In case of 1(?:ab)2 the brackets have no meaning. So the code is just inlined:

 1: EXACTLY (9) 1
 9: EXACTLY (18) ab
18: EXACTLY (26) 2
26: END (0)
Alexey-T commented 10 months ago

@User4martin do we still need:

  OP_OPEN = TREOp(50); // Opening of group
  OP_CLOSE = TREOp(51); // Closing of group
  OP_OPEN_ATOMIC = TREOp(52); // Opening of group
  OP_CLOSE_ATOMIC = TREOp(53); // Closing of group   
User4martin commented 10 months ago

Yes.

Only (?: was changed.

A normal ( must still omit an opcode => so the capture can be saved (and it must support backtracking (in 99% of cases) so it needs recursion).

(?> still produces OP_OPEN_ATOMIC. And that is also still needed (Though there is a limited subset of cases, where it could be optimized...)

EDIT: scratch the last one. This is only the case for {1,1}+.

Alexey-T commented 10 months ago

A normal ( must still omit an opcode

Omit VS Emit. you mean 'emit'.