-J 1 gives the same result as - J 16

rbnor commented 4 years ago

Ubuntu 1604 Installed as per your latest guide.

time rg -j 1 -z -F 10.0.0.1 ./2019-01-1/2019-01-1.conn.[1,2].log.gz | wc -l 1000000

real 0m0.523s user 0m0.536s sys 0m0.428s

time rg -j 16 -z -F 10.0.0.1 ./2019-01-1/2019-01-1.conn.[1,2].log.gz | wc -l 1000000

real 0m0.364s user 0m0.480s sys 0m0.520s

time ugrep -J 16 -r -z -F 10.0.0.1 ./2019-01-1/2019-01-1.conn.[1,2].log.gz | wc -l 1000000

real 0m2.267s user 0m2.112s sys 0m0.492s

time ugrep -J 1 -z -F 10.0.0.1 ./2019-01-1/2019-01-1.conn.[1,2].log.gz | wc -l 1000000

real 0m2.308s user 0m2.128s sys 0m0.560s

rbnor commented 4 years ago

Performance counter stats for 'ugrep -J 16 -z -F 10.0.0.1 ./2019-01-1/2019-01-1.conn.1.log.gz ./2019-01-1/2019-01-1.conn.2.log.gz':

   2251.037773      task-clock (msec)         #    1.000 CPUs utilized
             5      context-switches          #    0.002 K/sec
             2      cpu-migrations            #    0.001 K/sec
           366      page-faults               #    0.163 K/sec
 5,471,050,122      cycles                    #    2.430 GHz                      (50.04%)
             0      stalled-cycles-frontend   #    0.00% frontend cycles idle     (49.89%)
             0      stalled-cycles-backend    #    0.00% backend  cycles idle     (50.17%)
             0      instructions              #    0.00  insns per cycle          (50.44%)
             0      branches                  #    0.000 K/sec                    (50.37%)
             0      branch-misses             #    0.00% of all branches          (50.28%)

   2.251656870 seconds time elapsed

rbnor commented 4 years ago

It does spawn more threads but it never uses above 100% accumulated over all the threads, so it seems artificially capped? So utilization of resources available is very low. Now testing on the same unzipped files, and the case is the same. Testing with -J 12 its clear that it spreads load across the first 7 thread, totaling 100%, with the last 5 being absolutely idle. Tested with 1000 unzipped testfiles of the same type.

perf stat ugrep -J 12 -F 10.0.0.1 ./rbergh/dir/* | wc -l

258711.305195 task-clock (msec) # 1.004 CPUs utilized 2,486 context-switches # 0.010 K/sec 155 cpu-migrations # 0.001 K/sec 261,697 page-faults # 0.001 M/sec 599,064,715,787 cycles # 2.316 GHz (49.99%) 0 stalled-cycles-frontend # 0.00% frontend cycles idle (49.99%) 0 stalled-cycles-backend # 0.00% backend cycles idle (49.99%) 0 instructions # 0.00 insns per cycle (50.03%) 0 branches # 0.000 K/sec (50.02%)

             0      branch-misses             #    0.00% of all branches          (50.02%)

 257.796857184 seconds time elapsed

For comparisson: 181422.011667 task-clock (msec) # 2.722 CPUs utilized 810,572 context-switches # 0.004 M/sec 1,645 cpu-migrations # 0.009 K/sec 1,050,283 page-faults # 0.006 M/sec 433,555,039,396 cycles # 2.390 GHz (50.02%) 0 stalled-cycles-frontend # 0.00% frontend cycles idle (50.03%) 0 stalled-cycles-backend # 0.00% backend cycles idle (50.03%) 0 instructions # 0.00 insns per cycle (49.99%) 0 branches # 0.000 K/sec (50.00%) 0 branch-misses # 0.00% of all branches (50.00%)

  66.653933530 seconds time elapsed

500500000

find ./rbergh/dir/ -name '' -print0 |perf stat xargs -P12 -0 -I {} sh -c 'cat {}| gre p -F 10.0.0.1' | wc -l

Performance counter stats for 'xargs -P12 -0 -I {} sh -c cat {}| grep -F 10.0.0.1':

 557066.101261      task-clock (msec)         #    8.845 CPUs utilized
     5,545,246      context-switches          #    0.010 M/sec
       302,980      cpu-migrations            #    0.544 K/sec
       308,185      page-faults               #    0.553 K/sec

1,263,417,643,081 cycles # 2.268 GHz (50.81%) 0 stalled-cycles-frontend # 0.00% frontend cycles idle (50.70%) 0 stalled-cycles-backend # 0.00% backend cycles idle (50.78%) 0 instructions # 0.00 insns per cycle (50.76%) 0 branches # 0.000 K/sec (50.83%) 0 branch-misses # 0.00% of all branches (50.79%)

  62.978082644 seconds time elapsed

500500000

genivia-inc commented 4 years ago

Firstly, the timings with ugrep -z were bad, because ugrep 1.5.1 wasn't yet optimized for decompression and used a rudimentary and slow streambuf implementation that reads the decompressed stream byte-by-byte. This is fixed in 1.5.2.

Secondly, options such as -o (only matching), -c (count), -l (list files), -q (quiet) are much faster than the default line-by-line reading in the older ugrep 1.5.1. Optimizing the line-by-line matching will be addressed soon to ensure that it is as fast as -o. This enhancement is work in progress.

In fact, your example searches just two files with a lot of hits (all lines are a hit!).

Now, with the updated ugrep 1.5.3 I get the expected linear speedups for -z with -o and also when counting matches with -c:

time ugrep -z -Fc 10.0.0.1 ./2019-01-1/2019-01-1.conn.1.log.gz ./2019-01-1/2019-01-1.conn.2.log.gz | wc
        0.07 real         0.15 user         0.01 sys
       2       2      86
time ugrep -J1 -z -Fc 10.0.0.1 ./2019-01-1/2019-01-1.conn.1.log.gz ./2019-01-1/2019-01-1.conn.2.log.gz | wc
        0.14 real         0.14 user         0.01 sys
       2       2      86
time zgrep -Fc 10.0.0.1 ./2019-01-1/2019-01-1.conn.1.log.gz ./2019-01-1/2019-01-1.conn.2.log.gz | wc
        0.52 real         0.51 user         0.00 sys
       2       2      86

With the older 1.5.1, when you used ugrep to output all matches in (almost) all files as in your example, the results were more or less sequentialized due to threads taking turns to output results. The updated ugrep 1.5.3 output buffering strategy squeezes more performance out of parallel searches.

Comparing this to ripgrep there isn't much of a difference:

time ripgrep -z -Fc 10.0.0.1 ./2019-01-1/2019-01-1.conn.1.log.gz ./2019-01-1/2019-01-1.conn.2.log.gz | wc
        0.06 real         0.15 user         0.02 sys
       2       2      86

As I've stated in the README, the algorithm in ugrep to match a single word or a few words in some cases is not yet fully optimized (e.g. no SIMD/AVX), which accounts for the small difference.

Note: edited to include improvements in ugrep 1.5.3

genivia-inc commented 4 years ago

The updated ugrep 1.5.4 is much faster due to fixing unbuffered reads from pipes and other improvements:

time ugrep -z -F 10.0.0.1 ./2019-01-1/2019-01-1.conn.[12].log.gz | wc -l
        0.29 real         0.49 user         0.04 sys
 1000000

With one thread (J1) takes 0.49s instead of 0.29s showing a good speedup for two concurrent searches:

time ugrep -J1 -z -F 10.0.0.1 ./2019-01-1/2019-01-1.conn.[12].log.gz | wc -l
        0.49 real         0.48 user         0.02 sys
 1000000

Using xargs -P2 is about the same as running ugrep with two threads (2 for 2 files plus 2 for decompressing the files using task parallelism):

time ./findxargs.sh | wc -l
        0.31 real         0.50 user         0.05 sys
 1000000

where findxargs.sh contains the line find 2019-01-1 -name '2019-01-1.conn.[12].log.gz' -print0 | xargs -P2 -0 ugrep -z -F 10.0.0.1

Note: ugrep options -o, -c, -l, -q, and -N are much faster than without these at this time, because line-by-line reading and matching is not yet fully optimized (work in progress), for example with -o:

time ugrep -zo -F 10.0.0.1 ./2019-01-1/2019-01-1.conn.[12].log.gz | wc -l
        0.16 real         0.26 user         0.03 sys
 1000000

and with -c:

time ugrep -zc -F 10.0.0.1 ./2019-01-1/2019-01-1.conn.[12].log.gz | wc -l
        0.07 real         0.15 user         0.01 sys
       2

Now the ugrep -zo time of 0.16s is faster than ripgrep's -zo time of 0.20s while both use threads:

time ripgrep -zo -F 10.0.0.1 ./2019-01-1/2019-01-1.conn.[12].log.gz | wc -l
        0.20 real         0.27 user         0.09 sys
 1000000

Because all lines of the input files match, your suggested tests are a significant outlier compared to the usual queries that return more sparse results.

genivia-inc commented 4 years ago

I will close this issue since ugrep option -J was improved by improving asynchronous output speeds, which is particularly important when a lot of hits are reported as in your test. I also used task parallelism to unzip files in parallel to searching, further improving the speed of ugrep to make it faster than piping the output of gunzip to ugrep.

I have further updates in the pipeline (available soon) that further improve the speed of ugrep, including searching with the default options (default is line-by-line output), making it about as fast as -o (only-matching).

rbnor commented 4 years ago

@genivia-inc Whats the status on making line-by-line matchingas fast as multiline search with -o ? Because -o is insanely fast, but i need the entire matching line.

genivia-inc commented 4 years ago

Will have this soon.

My thoughts on this: I already updated RE/flex to version 1.5.0. That version returns the line of a match with the new line() matcher method. I've also added a new method span() that is more efficient than line(), but span() changes the buffer (to add a terminating \0), meaning we can't use it for mmap-ed files. With the new method(s), I expect ugrep's line matching to be as fast as -o and also allow matches to span multiple lines like -o can, which is very nice to have as a bonus. As a small caveat though, we may not be able to use the new line matching methods for options -A, -B, and -C, because the context (after/before) requires a history of lines read (especially -B i.e. before). It could be done with mmap-ed files of course, since we can look back in the file, but not with piped input or decompressed streams since we don't want to buffer all input for optimal search efficiency.

rbnor commented 4 years ago

Awesome, looking forward to that. Not really an issue for my main case with the terminating \0 as mmapeing files isnt that important for me with a larger number of smaller decompressed files. But its something to be aware of for sure

Very interesting. And agree on all points.

I have to say, -o is insanely effective, an easy 4x on performance on my system with smaller sets and somehow it manages a higher consistent throughput on the IO, at least it looks that way. Testing on a larger dataset now, i see lower cpu utilization, a bit weird since its about half the usage, but i think that could be because ugrepis able to use the data so efficient, and as its IO bound its currently limited to my IO bandwidth. Its only using 10 1000% as oposed to 2500% without -o.

rbnor commented 4 years ago

Now it spiked up to the expected area, i think it just has something to do with the files in a certain range of my data, not an error or anything.

genivia-inc commented 4 years ago

Thanks for the feedback, that's good news.

There is a trick that may get you even better performance with --format='%o%~' instead of -o to output matches "as is", which is OK when the pattern matches text that can be displayed. This trick sends the match to the output immediately, bypassing some logic including the binary content check. The upcoming release has --format='%O%~' to display the matching line fast (the current 1.5.7 version is slower due to memory allocation that will be optimized away).

genivia-inc commented 4 years ago

Here are some initial results of the line matching improvements on my machine with a prototype implementation, just to get an idea what the upcoming version 1.5.8 can achieve or better.

With option -o to output matches only:

time ugrep -rzo -F 10.0.0.1 2019-01-1 | wc -l
       31.53 real       174.00 user         7.64 sys
 500000000

Without option -o to output matching lines:

time ugrep -rz -F 10.0.0.1 2019-01-1 | wc -l
       48.03 real       250.56 user        12.92 sys
 500000000

With option --format='%O%' to output matching lines:

time ugrep -rz --format='%O%~' -F 10.0.0.1 2019-01-1 | wc -l
       25.44 real       144.50 user         4.74 sys
 500000000

As you can see, the --format option is super fast (third run), with -o coming in second (first run). The reason that the second run (without -o) is slower is due to the logic required to check for multiple matches on the same line, which are all combined on a single line of output. By contrast, --format='%O%~' simply outputs a line for each match. But this may output multiple lines for multiple matches on the same line. That isn't the case for your test data, giving just one line match. But beware of this difference.

I am not yet happy with the results for the second run. The performance without option -o could be better. I'm looking into improvements. But it is still a lot better than 60s real time for this run with ugrep 1.5.5 that reads the input line-by-line by default to output matching lines, i.e. without option -o. Right now, we gained a 20% speed improvement making it 65% as fast as -o for my prototype that is still a bit crude. I will update this post on my progress with further improvements.

The way the matching works in 1.5.7 and later differs from previous versions (a RE/flex 1.5.0 update). Instead of shifting the input buffer up to the match when more input is read, the buffer shifts up to the start of the line. The input buffer is used when mmap cannot be used, e.g. with compressed files. The input buffer is essentially a sliding window over the input, while synchronizing on line starts, i.e. \n. I analyzed the performance difference between the buffer shift versions and there is no performance penalty that I can discern. Which is good. The advantage of this approach is that we always get the line with the match in the input buffer. So we no longer need to read the input line-by-line. However, we still need an extra buffer rest_line to store the rest of the line after the match, which requires a memcpy (for std::string::assign used as buffer in my prototype). If the next match is not on the same line, but afterwards, then we output the rest of the line in rest_line before displaying the next match. But if the next match is on the same line, then we update this extra rest_line buffer, which is cheap and takes just a pointer update instead of memcpy.

It would be advantageous to have a look-ahead mechanism that would tell us if the next match is on the same line. This would remove the need to keep the rest of the line in the rest_line buffer. Unfortunately, to try to achieve this by just matching the pattern up to the end of the line is not good enough, because that would invalidate multi-line pattern matches that overlap the current line with the next.

The cost to keep the rest_line buffer is really only due to multiline matching of patterns that include \n and could be removed otherwise.

rbnor commented 4 years ago

Very very good, and an easy 70% performance increase here! Which makes it over 10x as fast as find, xargs and zgrep. Well done sir! Is it possible to output the filename at the start of each line as well or would that significantly slow it down because of introducing some of the output checks etc that were removed to make it faster? Or would it slow it down because of the need to store the current file in the buffer? Should not really be an issue as i suppose it keeps track of which file its at, but its more overhead the way i see it, because it cannot simply push the current line to the output, it has to fetch the filename from the buffer as well. But how much would it matter? Not sure.

Not really a very big of an issue, but it makes it going through logs, knowing what logfile the hit is from.

genivia-inc commented 4 years ago

Is it possible to output the filename at the start of each line as well or would that significantly slow it down because of introducing some of the output checks etc that were removed to make it faster?

I assume you're referring to the --format option? To output the filename, use field %f in --format='%f:%O%~'. There is not much of a performance penalty to produce the file name this way, just a slight increase in the running time to write the extra data.

The --format option is not only fast, it offers customization of the output in any way you want. In fact, the --json, --xml, --csv, and --cpp options use it so there is no hardcoding to produce these outputs. See the ugrep man page for the --format fields.

rbnor commented 4 years ago

Thanks a ton, thats perfect, i wasnt aware of how powerful that option was!

genivia-inc commented 4 years ago

I've updated support for compression with task parallelism. I also added support for other compression formats (if the libbzip2 and liblzma libs are installed). This 1.6.0 update speeds up single file search (and searching a few files) by searching and decompressing a file concurrently with option -z.

For option -z I also added compressed .tar.gz and uncompressed .tar file search, currently supporting tar format v7, ustar, gnu tar, and pax extended headers. Archived files are searched and reported.

For example, the processor utilization is 120% when searching a single medium-large compressed file (uncompressed 34MB):

time ugrep -zc serialize tests/big.cpp.gz
20743
0.053u 0.013s 0:00.05 120.0%    0+0k 0+0io 0pf+0w

Actually, the search engine runs faster than the decompression thread. If we have more search hits the processor utilization goes up but the wall clock execution time stays the same:

time ugrep -zc void tests/big.cpp.gz
50525
0.060u 0.013s 0:00.05 140.0%    0+0k 0+0io 0pf+0w

This means that the search time spent is completely hidden as decompression dominates the parallel execution time.

We see this even more clearly with the following experiment that uses bzip2 instead of gzip (libz):

time ugrep -zc serialize tests/big.cpp.bz2
20743
0.900u 0.014s 0:00.89 102.2%    0+0k 0+0io 0pf+0w

The bzip2 format is expensive to decompress.

The xz (lzma) format is not too bad compared to bzip2 but 2x to 3x slower than gzip decompression for this experiment:

time ugrep -zc serialize tests/big.cpp.xz
20743
0.148u 0.015s 0:00.14 107.1%    0+0k 0+0io 0pf+0w

Because option -z also reads uncompressed files in parallel, in some cases, depending on the machine and file system, we get a speed up for uncompressed file searches, giving 150% processor utilization on my machine:

time ugrep -zc serialize tests/big.cpp
20743
0.021u 0.019s 0:00.02 150.0%    0+0k 0+0io 0pf+0w

I did not expect this to be this fast. It is a nice bonus when using option -z.

When many files are concurrently searched with option -z there is no difference with the previous ugrep versions, due to processor resource limitations. At least there should not be a difference that impacts your use case. But feel free shoot me down!

Concurrent decompression and (tar) search is enabled with WITH_DECOMPRESSION_THREAD in the ugrep.cpp source code. This #define can be commented out to revert to serial decompression and search, but this also disables tar file search.

rbnor commented 4 years ago

Impressive work! And thanks for an amazing writeup, again! This is already my go to search-tool, and it just keeps on getting better. Well done sir.

genivia-inc commented 4 years ago

Good news! A quick update on ugrep's new performance enhancements such as AVX/SSE2 (v1.7.8), making ugrep even faster (on my machines) for your benchmark:

Matching 500 million lines in 1000 compressed files with -z and -o (down from 31.5s to 22.5s):

$ time ugrep -rzoF 10.0.0.1 rbergh/2019-01-1 | wc -l
       22.55 real        94.42 user        13.05 sys
 500000000

Matching 500 million lines in 1000 compressed files with -z and --format='%o%~' (down from 25.4s to 10.5s):

$ time ugrep -rzF --format='%o%~' 10.0.0.1 rbergh/2019-01-1 | wc -l
       10.49 real        57.14 user         8.23 sys
 500000000

Counting 500 million lines in 1000 compressed files with -z and -c:

$ time ugrep -rzcF 10.0.0.1 rbergh/2019-01-1 | wc -l
        6.51 real        44.73 user         6.85 sys
    1000

CPU utilization shows almost perfect speedup (8 cores):

43.656u 6.927s 0:06.94 728.6%   0+0k 0+0io 0pf+0w

rbnor commented 4 years ago

Awesome! Thats well done, i will test this as well asap!

On Tue, 11 Feb 2020 at 17:47, Robert van Engelen notifications@github.com wrote:

Good news! A quick update on ugrep's new performance enhancements such as AVX/SSE2 (v1.7.8), making ugrep even faster (on my machines) for your benchmark:

Matching 500 million lines in 1000 compressed files with -z and -o (down from 31.5s to 22.5s):

$ time ugrep -rzoF 10.0.0.1 rbergh/2019-01-1 | wc -l 22.55 real 94.42 user 13.05 sys 500000000

Matching 500 million lines in 1000 compressed files with -z and --format='%o%~' (down from 25.4s to 10.5s):

$ time ugrep -rzF --format='%o%~' 10.0.0.1 rbergh/2019-01-1 | wc -l 10.49 real 57.14 user 8.23 sys 500000000

Counting 500 million lines in 1000 compressed files with -z and -c:

$ time ugrep -rzcF 10.0.0.1 rbergh/2019-01-1 | wc -l 6.51 real 44.73 user 6.85 sys 1000

CPU utilization shows almost perfect speedup (8 cores):

43.656u 6.927s 0:06.94 728.6% 0+0k 0+0io 0pf+0w

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Genivia/ugrep/issues/4?email_source=notifications&email_token=AGJIWV5JGYUFEGSYKWMQCE3RCLJDZA5CNFSM4JAQJTMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELNEZTI#issuecomment-584731853, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGJIWV3FQBLZSVOGYJMW4N3RCLJDZANCNFSM4JAQJTMA .

genivia-inc commented 4 years ago

Note that 1.7.8 was updated moments ago as there was a late-breaking issue detected in the download package after running tests. Sorry about that.

Genivia / ugrep

-J 1 gives the same result as - J 16 #4