Closed NightMachinery closed 3 years ago
If you don't use Unicode then \w
and \W
are efficient and compact. But with Unicode (default), try to avoid using multiple \w
and \W
. PCRE may not accept the long patterns produced with several \w
and \W
as your example shows.
Also, because of significant memory requirements of the current regex engine. It is not only the size of the pattern that is big, but the DFA is big.
I have been thinking about a change to the regex engine to represent Unicode characters in UTF-16 instead of UTF-8. With UTF-8 the size requirements for \w
and \W
are huge and the DFA constructions takes time. This is a (minor) drawback of the current regex engine. I consider it minor, because it only happens when several \w
and \W
are used in a pattern like in your example.
So I will mark this as an enhancement.
Using the regex
\Wntl.\W\s*\(\)|:\s*alias[^:=]*\s*\Wntl.\W|:\s*alifn[^:=]*\s*\Wntl.\W
results in:This is in PCRE mode, I haven't tested it with the other mode.