Open MasterInQuestion opened 2 weeks ago
On Tue, Oct 01, 2024 at 01:06:26AM -0700, MasterInQuestion wrote:
https://github.com/MasterInQuestion/coordTransform/commit/dabc782b3e200d81210032f4dc6222b344e13d0b
Test case: [[
> perl -e "'deEfficiency' =~ /d?(?:eficiency|[eE]fficiency)/;" -M"re"="debug" Compiling REx "d?(?:eficiency|[eE]fficiency)" Final program: 1: CURLY{0,1} (7) 5: EXACT <d> (0) 7: BRANCH (buf:0/0) (13) 9: EXACT <eficiency> (22) 13: BRANCH (buf:0/0) (FAIL) 15: ANYOFM[Ee] (17) 17: EXACT <fficiency> (22) 21: TAIL (22) 22: END (0) minlen 9 Matching REx "d?(?:eficiency|[eE]fficiency)" against "deEfficiency" 0 <> <deEfficien> | 0| 1:CURLY{0,1}(7)
[snip 2 sets of debugging output]
So what exact behaviour do you consider sub-optimal, and what exact behaviour would you like to see in its place?
-- That he said that that that that is is is debatable, is debatable.
This code is a bit contrived. By starting the alternation with a character class you disable some of the optimizations that the current regex is capable of making. Currently perl's regex engine doesn't know how to unroll character-classes in alternations. If you manually unroll the char-class then you end up with this:
perl -Mre=debug -e'"deEfficiency" =~ /d?(?:eficiency|efficiency|Efficiency)/;'
Compiling REx "d?(?:eficiency|efficiency|Efficiency)"
Final program:
1: CURLY{0,1} (5)
3: EXACT <d> (0)
5: TRIEC-EXACT[Ee] (21)
<eficiency>
<efficiency>
<Efficiency>
21: END (0)
minlen 9
Matching REx "d?(?:eficiency|efficiency|Efficiency)" against "deEfficiency"
0 <> <deEfficien> | 0| 1:CURLY{0,1}(5)
| 0| EXACT <d> can match 1 times out of 1...
1 <d> <eEfficienc> | 1| 5:TRIEC-EXACT[Ee](21)
1 <d> <eEfficienc> | 1| TRIE: State: 1 Accepted: N TRIE: Charid: 1 CP: 65 After State: 2
2 <de> <Efficiency> | 1| TRIE: State: 2 Accepted: N TRIE: Charid: 7 CP: 45 After State: 0
| 1| failed...
0 <> <deEfficien> | 1| 5:TRIEC-EXACT[Ee](21)
| 1| TRIE: failed to match trie start class...
| 0| failed...
1 <d> <eEfficienc> | 0| 1:CURLY{0,1}(5)
| 0| EXACT <d> can match 0 times out of 1...
1 <d> <eEfficienc> | 1| 5:TRIEC-EXACT[Ee](21)
1 <d> <eEfficienc> | 1| TRIE: State: 1 Accepted: N TRIE: Charid: 1 CP: 65 After State: 2
2 <de> <Efficiency> | 1| TRIE: State: 2 Accepted: N TRIE: Charid: 7 CP: 45 After State: 0
| 1| failed...
| 0| failed...
2 <de> <Efficiency> | 0| 1:CURLY{0,1}(5)
| 0| EXACT <d> can match 0 times out of 1...
2 <de> <Efficiency> | 1| 5:TRIEC-EXACT[Ee](21)
2 <de> <Efficiency> | 1| TRIE: State: 1 Accepted: N TRIE: Charid: 7 CP: 45 After State: 13
3 <deE> <fficiency> | 1| TRIE: State: 13 Accepted: N TRIE: Charid: 2 CP: 66 After State: 14
4 <deEf> <ficiency> | 1| TRIE: State: 14 Accepted: N TRIE: Charid: 2 CP: 66 After State: 15
5 <deEff> <iciency> | 1| TRIE: State: 15 Accepted: N TRIE: Charid: 3 CP: 69 After State: 16
6 <deEffi> <ciency> | 1| TRIE: State: 16 Accepted: N TRIE: Charid: 4 CP: 63 After State: 17
7 <deEffic> <iency> | 1| TRIE: State: 17 Accepted: N TRIE: Charid: 3 CP: 69 After State: 18
8 <deEffici> <ency> | 1| TRIE: State: 18 Accepted: N TRIE: Charid: 1 CP: 65 After State: 19
9 <deEfficie> <ncy> | 1| TRIE: State: 19 Accepted: N TRIE: Charid: 5 CP: 6e After State: 1a
10 <deEfficien> <cy> | 1| TRIE: State: 1a Accepted: N TRIE: Charid: 4 CP: 63 After State: 1b
11 <deEfficienc> <y> | 1| TRIE: State: 1b Accepted: N TRIE: Charid: 6 CP: 79 After State: 1c
12 <deEfficiency> <> | 1| TRIE: State: 1c Accepted: Y TRIE: Charid: 0 CP: 0 After State: 0
| 1| TRIE: got 1 possible matches
| 1| TRIE matched word #3, continuing
| 1| TRIE: only one match left, short-circuiting: #3 <Efficiency>
12 <deEfficiency> <> | 1| 21:END(0)
Match successful!
Freeing REx: "d?(?:eficiency|efficiency|Efficiency)"
which does not backtrack and uses a DFA like approach. (A TRIE is just a special case of a DFA in this context.)
Backtracking past "d" for "e" alike would be entirely pointless: for non-overlapping impossible to match. Alike in linked past "-" (or ".") for "\d" etc.
Off-Topic: What does the number in "re" debug output mean? [[ Final program: 1: ... (7) 5: ... ]] I noted some the numbers changed between the old version's output and current.
https://github.com/MasterInQuestion/coordTransform/commit/dabc782b3e200d81210032f4dc6222b344e13d0b
Test case: [[
]]
Tested against Perl 5.40. Been like this years ago.