regexes with many \w and \W are too large for PCRE

Genivia / ugrep

NEW ugrep 6.5: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more

BSD 3-Clause "New" or "Revised" License

2.63k stars 110 forks source link

If you don't use Unicode then \w and \W are efficient and compact. But with Unicode (default), try to avoid using multiple \w and \W. PCRE may not accept the long patterns produced with several \w and \W as your example shows.

Also, because of significant memory requirements of the current regex engine. It is not only the size of the pattern that is big, but the DFA is big.

I have been thinking about a change to the regex engine to represent Unicode characters in UTF-16 instead of UTF-8. With UTF-8 the size requirements for \w and \W are huge and the DFA constructions takes time. This is a (minor) drawback of the current regex engine. I consider it minor, because it only happens when several \w and \W are used in a pattern like in your example.

So I will mark this as an enhancement.

Genivia / ugrep

regexes with many \w and \W are too large for PCRE #114