Genivia / ugrep

NEW ugrep 6.5: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more
https://ugrep.com
BSD 3-Clause "New" or "Revised" License
2.63k stars 110 forks source link

Ongoing work on ugrep 3.7 to increase speed further by removing performance bottlenecks #188

Closed genivia-inc closed 2 years ago

genivia-inc commented 2 years ago

I am working on important improvements to the ugrep regex pattern engine for the upcoming 3.7 release to address two performance bottlenecks, both of which are edge cases:

  1. RAM usage for very large -F patterns is high.
  2. some type of regex patterns involving Unicode classes take too much time (in my opinion) to parse and construct as DFAs.

Both are fixed in the 3.7 dev version. Let me explain:

The first issue is due to the implementation's use of a tree structure for "string-like" regex patterns and option -F. This structure can be significantly reduced in size. This improves performance significantly (at least 4x) for a set of 32MB string patterns with options -F and -f.

The second issue is caused by the significant dynamic memory allocation cost of std::set<int>, which is larger than I had anticipated when I implemented the so-called subset construction algorithm. The algorithm starts producing DFA states with hundreds to thousands of NFA positions, thus the cost "explodes" at some point. I'm replacing this part of the regex analysis and subset construction with std::vector<int> and then sort them only when needed at the very end. The dynamic memory allocation overhead with this approach is very low. This yields significant speedups for such "exploding" cases.

Update

To demonstrate the first ugrep speed improvement with a 2MB 1K1.txt pattern file with strings to search a 100MB file enwik8.

GNU grep takes 2.81 seconds:

/usr/bin/time ggrep -F -c -f 1K1.txt enwik8
0
        2.81 real         2.71 user         0.06 sys

Ugrep v3.7 takes 1.56 seconds:

/usr/bin/time ugrep -F -c -f 1K1.txt enwik8
0
        1.56 real         1.33 user         0.21 sys

To give some insights into the second improvement, Unicode character classes such as \X that matches any valid Unicode character (range) produce large and complex DFAs. For example, \X+ looks like this:

image

For other \p{name} classes the results can be a lot more complex. Constructing the DFAs is now done in a few milliseconds, where previous ugrep versions took longer, sometimes much longer.

genivia-inc commented 2 years ago

Ugrep v3.7.0 is released. This version runs several times faster for long and complex regex patterns and for -F string patterns.