Closed genivia-inc closed 10 months ago
π
Let me say, you continue to keep us impressed... sure, there might be one or two cases where ugrep
, in some scenarios, is ever-so-slightly slower (one wonders what's going on with gz compression...), but in this 'game of benchmarks', you get extra points for consistency of overall performance under several very different scenarios.
I was particularly impressed by the speed of your regexp processing! Surely the pcre2jit library is clearly the winner here. One wonders how Boost.regexp would fare in this scenario? In any case, if pcre2jit is that good, why doesn't everybody use it? :-) (The results, after all, speak for themselves...)
Thank you for reading my post and for commenting. I don't expect many will read all these details and find it all that interesting, except maybe those who love this field of work ;-)
I'm not 100% happy with the performance yet and have been thinking about it over 2 years. Since I started implementing new features that people wanted and I also wanted to add, I had put additional performance enhancements on the back burner until later. Some of these regex cases that are left to optimize are a bit "artificial", but nevertheless important enough to optimize as it should.
On your comment about PCRE2/Boost.Regex: note that ugrep's default regex engine is RE/flex, not PCRE2 jit. PCRE2 jit is only used with option -P
for Perl-compatible regex. Boost.Regex is an option if PCRE2 is not installed. Boost.Regex is pretty good too, but there is not a means to perfectly implement searching streaming input since it doesn't fully support partial matching like PCRE2 does (I long time ago I discussed this with the Boost.Regex author, but the full partial matching requirements is still not completed). The default regex engine is purely DFA-based and pretty quick, see also the performance table for matching/scanning (not searching) on the RE/flex project page: https://github.com/Genivia/RE-flex
On your comment about gz: there is no significant difference in performance on the arm64, but the x64 gz-compressed search appears to be slower than expected. It might be the libz version difference or the decompression buffer size is suboptimal or some cache effect. I will look into that.
Edit: needed to clarify "artificial", I mean the cases left to optimize, not the cases benchmarked and added a note about gz.
Oh... I admit that you lost me with some of the multitude of acronyms related to CPU microarchitecture :) I'm from the days when SSE2 was the new kid in the block :) β which means that I'm completely out of the loop and merely browsed very quickly across your post. Nevertheless, I can certainly appreciate the amount of testing data that you have accumulated. Obviously, I haven't reproduced your results :) and therefore take your word on it :)
But I think that's irrelevant; you can always skew benchmarks and statistics to show whatever you wish to show. What matters most here is the order of magnitude in many of the different tests β i.e. measuring 10x the speed difference will be perceptible, no matter what system you're running it on β while a 5-10% difference may just be too close to the overall error margin to be significative. I don't know β for a research paper, I suppose you'd be asked to provide much more than that (and, then again, I might be wrong on that assumption!).
While everybody loves extra featuresΒ β and often these make a difference when choosing one option over another! β in my case I'm much more curious on how far you can push all those optimisations, what kind of limits you'll reach, and if you'll have to start trading off memory vs. CPU (possibly switched via a CLI option...?) in order to squeeze a bit more performance out of it...
Thanks for the clarification about the regex engine you're using (it shows that I really didn't bother to read the source code, doesn't it? π ). In fact, I had never heard of RE/flex before, and I wonder β is it your own invention? If so, your results are certainly so impressive that I cannot but wonder why it hasn't been universally adopted everywhere :) Faster than RE2? Ouchie β ok, so that is really fast (because, well, RE2 cheats in order to get good results β by limiting the kind of regexs you can construct, focusing on the essential only... which always surprises me because I'm 'expecting' a larger syntax). Google must hate you :) In fact, I was going to ask if you did test RE2, but from the RE/flex README, it's clear that you did, and found that RE/flex is even faster than that.
I'm curious why you need to point out that RE/flex is "purely DFA-based". Never having to implement a regex parser (gosh, I barely manage to get some expressions working on regex101.com...), I erroneously thought that all were DFAs (wasn't that what Larry used originally?)... well, possibly with the exception of RE2, where it's clear they did a lot of cheating in order to boost performance. I guess that β again β my ignorance in such matters is legendary :) And to think that I used to be a pretty reasonable student attending the compilation classes, and liked them a lot β as well as playing around with yacc and lex, before even Stallmann wrote bison and flex. Gosh, I guess that I'm just too old and too outdated; but I thank you for your thorough explanations, I certainly learned quite a lot today!
As for gz
's performance difference... well, it's certainly intriguing and worth looking into. I always assumed that whoever did the library for x64 just cross-compiled it to arm64, and never bothered much with the details, leaving the optimisation to the compiler. But I now realise that, when talking about achieving maximum performance, it might require some real knowledge of the underlying architecture in order to squeeze every bit of performance at the machine code level... and that might be lacking on the x64 version (while the arm64 version might take into account certain characteristics of the ARM processor and therefore work much faster).
The latest ugrep v4.1 benchmark is now online: https://github.com/Genivia/ugrep-benchmarks
@GwynethLlewelyn I can certainly appreciate the amount of testing data that you have accumulated. Obviously, I haven't reproduced your results :) and therefore take your word on it :)
Thank you for your feedback and comments. I cannot possibly test on every possible contemporary computer architecture, so I picked MacBook Pro x64 with AVX2 and LPDDR4 and M1 Pro 10 core AArch64 with LPDDR5 as representative for common popular x64 and ARM64 machines. This will give a fair performance comparison of grep tools. Rest assured, these numbers are not skewed and are reported "as is". Others should be able to reproduce these on similar machines within a reasonably small error margin.
But I think that's irrelevant; you can always skew benchmarks and statistics to show whatever you wish to show. What matters most here is the order of magnitude in many of the different tests β i.e. measuring 10x the speed difference will be perceptible, no matter what system you're running it on β while a 5-10% difference may just be too close to the overall error margin to be significative. I don't know β for a research paper, I suppose you'd be asked to provide much more than that (and, then again, I might be wrong on that assumption!).
I agree that 10% faster or slower is not significant and usually falls well within the margin of system variance anyway. I didn't pick specific patterns for which ugrep specifically runs faster. For the time being I did avoid patterns similar to \w+foo
that I know are slow because of backtracking, see #288 for work in progress.
Benchmarking grep tools can never be exhaustive. The space is just too large to even contemplate that. There will always be some regex patterns and use cases for which one tool is faster than another due to the differences in matching algorithms and because heuristics are used. Recursive searches with threads may also differ due to cashing effects and the chance when a thread picks up a job and completes it to report results. Load balancing is tricky, which I've done with weighted round-robin using "job stealing" strategy when worker threads become idle. Nevertheless, perfect load balancing is an NP-complete problem, so heuristics must be used, which have their own peculiarities.
Ugrep 4.1 is most likely (pretty much always...?) faster than other fast grep tools, when #284, #287, #288 are applied and a few additional performance tweaks are made, such as one worker thread less to better balance the workers in recursive searches Note that
-ABC
is not as fast yet (unoptimized, see notes below) and leading wildcard patterns such as\w+foo
are not optimized at all yet, see #288.In addition to the OpenSSL 3.1.2 source code repo to compare recursive search speeds, a larger source code repo was added to the updated benchmark tests (swift-swift-5.8.1-RELEASE).
Since I am on a break, I don't have the means and time to release an update until about 3 weeks. Then you can try this yourself by downloading and running the ugrep-benchmarks scripts.
performance reports
Updated benchmarks are automatically generated and published when a new version of ugrep is released Last updated: 2023-08-29
performance report x64
performance report arm64
Intel machine:
ARM64 machine:
the
install.sh
script requires the following compression utilities:WARNING performance results are meaningless when the host machine executes other tasks that load the CPU; quit all running applications first and check for running background processes (with e.g.
top
) before running./bench.sh
important notes:
[][a-z]
in URL pattern testing,[\[\]a-z]
works but this is not compliant so GNU grep fails)-I
to skip binary files; we include option-I
in recursive searches for a fair performance comparison-c
, whereas grep and ugrep output 0 matches as expected (note: ugrep option-m1,
skips zero matches but is not used in this benchmark)-z
is more powerful than just decompressing a single file to search, it searches nested archives up to nesting depth--zmax
(1 by default) by spawning one or more decompression theads; none of the other grep tools can search compressed tar files, nested archives and compressed files within archives-ABC
are a bit sub-optimal because of double buffering (input buffer and context buffer), which will be optimized soon in a future releaseperformance report x64
found ugrep 1172944 byte executable located at /usr/local/bin/ugrep
found rg 6075312 byte executable located at /opt/local/bin/rg
found ag 84764 byte executable located at /usr/local/bin/ag
found ggrep 263184 byte executable located at /usr/local/bin/ggrep
results for large text file search
grepping
rol
elapsed real time (s)grepping
the
elapsed real time (s)grepping
cycles|semigroups
elapsed real time (s)grepping
ab(cd?)?
elapsed real time (s)grepping
ro[a-z]*ds
elapsed real time (s)grepping
(19|20)[0-9]{2}/(0[1-9]|1[012])|(0[1-9]|1[012])/(19|20)[0-9]{2}
elapsed real time (s)grepping
(https?://|www\.)[-a-zA-Z0-9@:%._+~#=]{1,253}\.[-a-zA-Z0-9]{2,}\.[][a-zA-Z0-9()@:%_+.~#?&/=\-]+
elapsed real time (s)grepping
^={2,4}[^=].*
elapsed real time (s)grepping
''
elapsed real time (s)grepping
^$
elapsed real time (s)results for large text file search for words from files
grepping
-fwords/1.txt
elapsed real time (s)grepping
-fwords/2.txt
elapsed real time (s)grepping
-fwords/3.txt
elapsed real time (s)grepping
-fwords/4.txt
elapsed real time (s)results for large text file search with formatted output
grepping
Sherlock|Holmes
elapsed real time (s)results for large text file search with replaced output
grepping
flop
elapsed real time (s)results for large text file search with context
grepping
^$
elapsed real time (s)grepping
Sherlock|Holmes
elapsed real time (s)results for OpenSSL source code repo directory search
grepping
FIXME|TODO
elapsed real time (s)grepping
char|int|long|size_t|void
elapsed real time (s)grepping
ssl-?3(\.[0-9]+)?
elapsed real time (s)results for Swift source code repo directory search
grepping
_(RUN|LIB|NAM)[A-Z_]+
elapsed real time (s)grepping
String|Int|Double|Array|Dictionary
elapsed real time (s)grepping
(class|struct)\sS[a-z]+T
elapsed real time (s)grepping
for\s[a-z]+\sin
elapsed real time (s)results for bz2 compressed large text file search
grepping
landsnail
elapsed real time (s)results for gz compressed large text file search
grepping
landsnail
elapsed real time (s)results for lz4 compressed large text file search
grepping
landsnail
elapsed real time (s)results for xz compressed large text file search
grepping
landsnail
elapsed real time (s)results for zstd compressed large text file search
grepping
landsnail
elapsed real time (s)results for zip archived repo search
grepping
FIXME|TODO
elapsed real time (s)results for tar archived repo search
grepping
FIXME|TODO
elapsed real time (s)results for compressed tarball search
grepping
FIXME|TODO
elapsed real time (s)performance report arm64
found ugrep 1091554 byte executable located at /usr/local/bin/ugrep
found rg 5571088 byte executable located at /opt/local/bin/rg
found ag 111344 byte executable located at /opt/homebrew/bin/ag
found ggrep 266352 byte executable located at /opt/homebrew/bin/ggrep
results for large text file search
grepping
rol
elapsed real time (s)grepping
the
elapsed real time (s)grepping
cycles|semigroups
elapsed real time (s)grepping
ab(cd?)?
elapsed real time (s)grepping
ro[a-z]*ds
elapsed real time (s)grepping
(19|20)[0-9]{2}/(0[1-9]|1[012])|(0[1-9]|1[012])/(19|20)[0-9]{2}
elapsed real time (s)grepping
(https?://|www\.)[-a-zA-Z0-9@:%._+~#=]{1,253}\.[-a-zA-Z0-9]{2,}\.[][a-zA-Z0-9()@:%_+.~#?&/=\-]+
elapsed real time (s)grepping
^={2,4}[^=].*
elapsed real time (s)grepping
''
elapsed real time (s)grepping
^$
elapsed real time (s)results for large text file search for words from files
grepping
-fwords/1.txt
elapsed real time (s)grepping
-fwords/2.txt
elapsed real time (s)grepping
-fwords/3.txt
elapsed real time (s)grepping
-fwords/4.txt
elapsed real time (s)results for large text file search with formatted output
grepping
Sherlock|Holmes
elapsed real time (s)results for large text file search with replaced output
grepping
flop
elapsed real time (s)results for large text file search with context
grepping
^$
elapsed real time (s)grepping
Sherlock|Holmes
elapsed real time (s)results for OpenSSL source code repo directory search
grepping
FIXME|TODO
elapsed real time (s)grepping
char|int|long|size_t|void
elapsed real time (s)grepping
ssl-?3(\.[0-9]+)?
elapsed real time (s)results for Swift source code repo directory search
grepping
_(RUN|LIB|NAM)[A-Z_]+
elapsed real time (s)grepping
String|Int|Double|Array|Dictionary
elapsed real time (s)grepping
(class|struct)\sS[a-z]+T
elapsed real time (s)grepping
for\s[a-z]+\sin
elapsed real time (s)results for bz2 compressed large text file search
grepping
landsnail
elapsed real time (s)results for gz compressed large text file search
grepping
landsnail
elapsed real time (s)results for lz4 compressed large text file search
grepping
landsnail
elapsed real time (s)results for xz compressed large text file search
grepping
landsnail
elapsed real time (s)results for zstd compressed large text file search
grepping
landsnail
elapsed real time (s)results for zip archived repo search
grepping
FIXME|TODO
elapsed real time (s)results for tar archived repo search
grepping
FIXME|TODO
elapsed real time (s)results for compressed tarball search
grepping
FIXME|TODO
elapsed real time (s)