Genivia / ugrep

NEW ugrep 6.5: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more
https://ugrep.com
BSD 3-Clause "New" or "Revised" License
2.63k stars 110 forks source link

regexes with many \w and \W are too large for PCRE #114

Closed NightMachinery closed 3 years ago

NightMachinery commented 3 years ago

Using the regex \Wntl.\W\s*\(\)|:\s*alias[^:=]*\s*\Wntl.\W|:\s*alifn[^:=]*\s*\Wntl.\W results in:

ugrep: error: error in regex at position 95506
0-\xbf]|\xf4[\x80-\x8f][\x80-\xbf][\x80-\xbf])
           regular expression is too large___/

This is in PCRE mode, I haven't tested it with the other mode.

genivia-inc commented 3 years ago

If you don't use Unicode then \w and \W are efficient and compact. But with Unicode (default), try to avoid using multiple \w and \W. PCRE may not accept the long patterns produced with several \w and \W as your example shows.

Also, because of significant memory requirements of the current regex engine. It is not only the size of the pattern that is big, but the DFA is big.

I have been thinking about a change to the regex engine to represent Unicode characters in UTF-16 instead of UTF-8. With UTF-8 the size requirements for \w and \W are huge and the DFA constructions takes time. This is a (minor) drawback of the current regex engine. I consider it minor, because it only happens when several \w and \W are used in a pattern like in your example.

So I will mark this as an enhancement.