Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.94k stars 553 forks source link

RegEx pointless backtracking between non-overlapping sub-patterns #22636

Open MasterInQuestion opened 2 weeks ago

MasterInQuestion commented 2 weeks ago

    https://github.com/MasterInQuestion/coordTransform/commit/dabc782b3e200d81210032f4dc6222b344e13d0b

    Test case: [[

> perl -e "'deEfficiency' =~ /d?(?:eficiency|[eE]fficiency)/;" -M"re"="debug"
Compiling REx "d?(?:eficiency|[eE]fficiency)"
Final program:
   1: CURLY{0,1} (7)
   5:   EXACT <d> (0)
   7: BRANCH (buf:0/0) (13)
   9:   EXACT <eficiency> (22)
  13: BRANCH (buf:0/0) (FAIL)
  15:   ANYOFM[Ee] (17)
  17:   EXACT <fficiency> (22)
  21: TAIL (22)
  22: END (0)
minlen 9
Matching REx "d?(?:eficiency|[eE]fficiency)" against "deEfficiency"
   0 <> <deEfficien>         |   0| 1:CURLY{0,1}(7)
                             |   0| EXACT <d> can match 1 times out of 1...
   1 <d> <eEfficienc>        |   1|  7:BRANCH (buf:0/0)(13)
   1 <d> <eEfficienc>        |   2|   9:EXACT <eficiency>(22)
                             |   2|   failed...
   1 <d> <eEfficienc>        |   1|  13:BRANCH (buf:0/0)(21)
   1 <d> <eEfficienc>        |   2|   15:ANYOFM[Ee](17)
   2 <de> <Efficiency>       |   2|   17:EXACT <fficiency>(22)
                             |   2|   failed...
                             |   1|  BRANCH failed...
   0 <> <deEfficien>         |   1|  7:BRANCH (buf:0/0)(13)
   0 <> <deEfficien>         |   2|   9:EXACT <eficiency>(22)
                             |   2|   failed...
   0 <> <deEfficien>         |   1|  13:BRANCH (buf:0/0)(21)
   0 <> <deEfficien>         |   2|   15:ANYOFM[Ee](17)
                             |   2|   failed...
                             |   1|  BRANCH failed...
                             |   0| failed...
   1 <d> <eEfficienc>        |   0| 1:CURLY{0,1}(7)
                             |   0| EXACT <d> can match 0 times out of 1...
   1 <d> <eEfficienc>        |   1|  7:BRANCH (buf:0/0)(13)
   1 <d> <eEfficienc>        |   2|   9:EXACT <eficiency>(22)
                             |   2|   failed...
   1 <d> <eEfficienc>        |   1|  13:BRANCH (buf:0/0)(21)
   1 <d> <eEfficienc>        |   2|   15:ANYOFM[Ee](17)
   2 <de> <Efficiency>       |   2|   17:EXACT <fficiency>(22)
                             |   2|   failed...
                             |   1|  BRANCH failed...
                             |   0| failed...
   2 <de> <Efficiency>       |   0| 1:CURLY{0,1}(7)
                             |   0| EXACT <d> can match 0 times out of 1...
   2 <de> <Efficiency>       |   1|  7:BRANCH (buf:0/0)(13)
   2 <de> <Efficiency>       |   2|   9:EXACT <eficiency>(22)
                             |   2|   failed...
   2 <de> <Efficiency>       |   1|  13:BRANCH (buf:0/0)(21)
   2 <de> <Efficiency>       |   2|   15:ANYOFM[Ee](17)
   3 <deE> <fficiency>       |   2|   17:EXACT <fficiency>(22)
  12 <deEfficiency> <>       |   2|   22:END(0)
Match successful!
Freeing REx: "d?(?:eficiency|[eE]fficiency)"

> perl -e "'deEfficiency' =~ /d?+(?:eficiency|[eE]fficiency)/;" -M"re"="debug"
Compiling REx "d?+(?:eficiency|[eE]fficiency)"
Final program:
   1: SUSPEND (11)
   3:   CURLY{0,1} (9)
   7:     EXACT <d> (0)
   9:   SUCCEED (0)
  10: TAIL (11)
  11: BRANCH (buf:0/0) (17)
  13:   EXACT <eficiency> (26)
  17: BRANCH (buf:0/0) (FAIL)
  19:   ANYOFM[Ee] (21)
  21:   EXACT <fficiency> (26)
  25: TAIL (26)
  26: END (0)
minlen 9
Matching REx "d?+(?:eficiency|[eE]fficiency)" against "deEfficiency"
   0 <> <deEfficien>         |   0| 1:SUSPEND(11)
   0 <> <deEfficien>         |   1|  3:CURLY{0,1}(9)
                             |   1|  EXACT <d> can match 1 times out of 1...
   1 <d> <eEfficienc>        |   2|   9:SUCCEED(0)
                             |   2|   SUCCEED: subpattern success...
   1 <d> <eEfficienc>        |   0| 11:BRANCH (buf:0/0)(17)
   1 <d> <eEfficienc>        |   1|  13:EXACT <eficiency>(26)
                             |   1|  failed...
   1 <d> <eEfficienc>        |   0| 17:BRANCH (buf:0/0)(25)
   1 <d> <eEfficienc>        |   1|  19:ANYOFM[Ee](21)
   2 <de> <Efficiency>       |   1|  21:EXACT <fficiency>(26)
                             |   1|  failed...
                             |   0| BRANCH failed...
   1 <d> <eEfficienc>        |   0| 1:SUSPEND(11)
   1 <d> <eEfficienc>        |   1|  3:CURLY{0,1}(9)
                             |   1|  EXACT <d> can match 0 times out of 1...
   1 <d> <eEfficienc>        |   2|   9:SUCCEED(0)
                             |   2|   SUCCEED: subpattern success...
   1 <d> <eEfficienc>        |   0| 11:BRANCH (buf:0/0)(17)
   1 <d> <eEfficienc>        |   1|  13:EXACT <eficiency>(26)
                             |   1|  failed...
   1 <d> <eEfficienc>        |   0| 17:BRANCH (buf:0/0)(25)
   1 <d> <eEfficienc>        |   1|  19:ANYOFM[Ee](21)
   2 <de> <Efficiency>       |   1|  21:EXACT <fficiency>(26)
                             |   1|  failed...
                             |   0| BRANCH failed...
   2 <de> <Efficiency>       |   0| 1:SUSPEND(11)
   2 <de> <Efficiency>       |   1|  3:CURLY{0,1}(9)
                             |   1|  EXACT <d> can match 0 times out of 1...
   2 <de> <Efficiency>       |   2|   9:SUCCEED(0)
                             |   2|   SUCCEED: subpattern success...
   2 <de> <Efficiency>       |   0| 11:BRANCH (buf:0/0)(17)
   2 <de> <Efficiency>       |   1|  13:EXACT <eficiency>(26)
                             |   1|  failed...
   2 <de> <Efficiency>       |   0| 17:BRANCH (buf:0/0)(25)
   2 <de> <Efficiency>       |   1|  19:ANYOFM[Ee](21)
   3 <deE> <fficiency>       |   1|  21:EXACT <fficiency>(26)
  12 <deEfficiency> <>       |   1|  26:END(0)
Match successful!
Freeing REx: "d?+(?:eficiency|[eE]fficiency)"

]]

    Tested against Perl 5.40.     Been like this years ago.

iabyn commented 2 weeks ago

On Tue, Oct 01, 2024 at 01:06:26AM -0700, MasterInQuestion wrote:

    https://github.com/MasterInQuestion/coordTransform/commit/dabc782b3e200d81210032f4dc6222b344e13d0b

    Test case: [[


> perl -e "'deEfficiency' =~ /d?(?:eficiency|[eE]fficiency)/;" -M"re"="debug"
Compiling REx "d?(?:eficiency|[eE]fficiency)"
Final program:
   1: CURLY{0,1} (7)
   5:   EXACT <d> (0)
   7: BRANCH (buf:0/0) (13)
   9:   EXACT <eficiency> (22)
  13: BRANCH (buf:0/0) (FAIL)
  15:   ANYOFM[Ee] (17)
  17:   EXACT <fficiency> (22)
  21: TAIL (22)
  22: END (0)
minlen 9
Matching REx "d?(?:eficiency|[eE]fficiency)" against "deEfficiency"
   0 <> <deEfficien>         |   0| 1:CURLY{0,1}(7)

[snip 2 sets of debugging output]

So what exact behaviour do you consider sub-optimal, and what exact behaviour would you like to see in its place?

-- That he said that that that that is is is debatable, is debatable.

demerphq commented 2 weeks ago

This code is a bit contrived. By starting the alternation with a character class you disable some of the optimizations that the current regex is capable of making. Currently perl's regex engine doesn't know how to unroll character-classes in alternations. If you manually unroll the char-class then you end up with this:

 perl -Mre=debug -e'"deEfficiency" =~ /d?(?:eficiency|efficiency|Efficiency)/;'
Compiling REx "d?(?:eficiency|efficiency|Efficiency)"
Final program:
   1: CURLY{0,1} (5)
   3:   EXACT <d> (0)
   5: TRIEC-EXACT[Ee] (21)
      <eficiency> 
      <efficiency> 
      <Efficiency> 
  21: END (0)
minlen 9 
Matching REx "d?(?:eficiency|efficiency|Efficiency)" against "deEfficiency"
   0 <> <deEfficien>         |   0| 1:CURLY{0,1}(5)
                             |   0| EXACT <d> can match 1 times out of 1...
   1 <d> <eEfficienc>        |   1|  5:TRIEC-EXACT[Ee](21)
   1 <d> <eEfficienc>        |   1|  TRIE: State:    1 Accepted: N TRIE: Charid:  1 CP:  65 After State:    2
   2 <de> <Efficiency>       |   1|  TRIE: State:    2 Accepted: N TRIE: Charid:  7 CP:  45 After State:    0
                             |   1|  failed...
   0 <> <deEfficien>         |   1|  5:TRIEC-EXACT[Ee](21)
                             |   1|  TRIE: failed to match trie start class...
                             |   0| failed...
   1 <d> <eEfficienc>        |   0| 1:CURLY{0,1}(5)
                             |   0| EXACT <d> can match 0 times out of 1...
   1 <d> <eEfficienc>        |   1|  5:TRIEC-EXACT[Ee](21)
   1 <d> <eEfficienc>        |   1|  TRIE: State:    1 Accepted: N TRIE: Charid:  1 CP:  65 After State:    2
   2 <de> <Efficiency>       |   1|  TRIE: State:    2 Accepted: N TRIE: Charid:  7 CP:  45 After State:    0
                             |   1|  failed...
                             |   0| failed...
   2 <de> <Efficiency>       |   0| 1:CURLY{0,1}(5)
                             |   0| EXACT <d> can match 0 times out of 1...
   2 <de> <Efficiency>       |   1|  5:TRIEC-EXACT[Ee](21)
   2 <de> <Efficiency>       |   1|  TRIE: State:    1 Accepted: N TRIE: Charid:  7 CP:  45 After State:   13
   3 <deE> <fficiency>       |   1|  TRIE: State:   13 Accepted: N TRIE: Charid:  2 CP:  66 After State:   14
   4 <deEf> <ficiency>       |   1|  TRIE: State:   14 Accepted: N TRIE: Charid:  2 CP:  66 After State:   15
   5 <deEff> <iciency>       |   1|  TRIE: State:   15 Accepted: N TRIE: Charid:  3 CP:  69 After State:   16
   6 <deEffi> <ciency>       |   1|  TRIE: State:   16 Accepted: N TRIE: Charid:  4 CP:  63 After State:   17
   7 <deEffic> <iency>       |   1|  TRIE: State:   17 Accepted: N TRIE: Charid:  3 CP:  69 After State:   18
   8 <deEffici> <ency>       |   1|  TRIE: State:   18 Accepted: N TRIE: Charid:  1 CP:  65 After State:   19
   9 <deEfficie> <ncy>       |   1|  TRIE: State:   19 Accepted: N TRIE: Charid:  5 CP:  6e After State:   1a
  10 <deEfficien> <cy>       |   1|  TRIE: State:   1a Accepted: N TRIE: Charid:  4 CP:  63 After State:   1b
  11 <deEfficienc> <y>       |   1|  TRIE: State:   1b Accepted: N TRIE: Charid:  6 CP:  79 After State:   1c
  12 <deEfficiency> <>       |   1|  TRIE: State:   1c Accepted: Y TRIE: Charid:  0 CP:   0 After State:    0
                             |   1|  TRIE: got 1 possible matches
                             |   1|  TRIE matched word #3, continuing
                             |   1|  TRIE: only one match left, short-circuiting: #3 <Efficiency>
  12 <deEfficiency> <>       |   1|  21:END(0)
Match successful!
Freeing REx: "d?(?:eficiency|efficiency|Efficiency)"

which does not backtrack and uses a DFA like approach. (A TRIE is just a special case of a DFA in this context.)

MasterInQuestion commented 2 weeks ago

    Backtracking past "d" for "e" alike would be entirely pointless: for non-overlapping impossible to match.     Alike in linked past "-" (or ".") for "\d" etc.

    Off-Topic:     What does the number in "re" debug output mean? [[ Final program:    1: ... (7)    5: ... ]]     I noted some the numbers changed between the old version's output and current.