Very slow with certain JSON files containing long lines

Genivia / ugrep

NEW ugrep 6.2: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more

https://ugrep.com

BSD 3-Clause "New" or "Revised" License

2.53k stars 108 forks source link

Very slow with certain JSON files containing long lines #169

Closed mohd-akram closed 2 years ago

mohd-akram commented 2 years ago

I'm using ugrep to tokenize JSON and I noticed it was quite slow on a certain large file. I recreated a similar looking file using this Node.js script:

// generate-json.js
function randomString() {
  let a = ''
  for (let i = 0; i < 5*Math.random(); i++)
    a += Math.random().toString(36).substring(2);
  return a;
}

function createObj(prob = 0.96) {
  const objCount = 20;
  const keyCount = 5;
  const obj = {};
  for (let i = 0; i < objCount; i++) {
    const sub = {};
    for (let j = 0; j < keyCount; j++) {
      sub[randomString()] = Math.random() > 0.5 ? Math.random() * 100 : randomString();
    }
    obj[randomString()] = Math.random() > prob ? createObj() : sub;
  }
  return obj;
}

function main() {
  const count = process.argv[2] || 900;
  for (let i = 0; i < count; i++) {
    process.stdout.write(JSON.stringify(createObj()));
    process.stdout.write('\n');
  }
}

main();

Run the above via node generate-json.js > large.json.

I used this script to test:

#!/bin/sh
# grep-test.sh
set -eu
ESCAPE='(\\[^u[:cntrl:]]|\\u[0-9a-fA-F]{4})'
CHAR='[^[:cntrl:]"\\]'
STRING="\"$CHAR*($ESCAPE$CHAR*)*\""
NUMBER='-?(0|[1-9][0-9]*)([.][0-9]*)?([eE][+-]?[0-9]*)?'
KEYWORD='null|false|true'
SPACE='[[:space:]]+'
${EGREP-grep -E} -o "$STRING|$NUMBER|$KEYWORD|[][{}:,]" "$@" | cat >/dev/null

Results:

$ EGREP="grep -E" time ./grep-test.sh large.json
        9.23 real         8.86 user         0.12 sys
$ EGREP=ugrep time ./grep-test.sh large.json
       55.83 real        51.57 user         0.43 sys

I've attached a smaller version of the file (n=500 instead of 900), but it should demonstrate the same disparity.

large2.json.gz

genivia-inc commented 2 years ago

I will test this, but that is very likely because of Unicode regex construction, which in anecdotal cases can be slow when character sets include Unicode, like your negated class [^[:cntrl:]"\\] that contains a huge Unicode range up to U+10FFFF.

When the goal is to match ASCII input, use option -U to disable Unicode. That's also compatible with GNU grep.

mohd-akram commented 2 years ago

Adding -U does not seem to make much of a difference in this case. This does not happen with other large JSON files, seemingly only with ones that have long lines and are somewhat nested.

genivia-inc commented 2 years ago

The other likely reason is that the pattern matches newlines, which makes the match potentially much longer than the long JSON lines. If a partial match is made, but the last part does not match, then backtracking will happen with ugrep (and grep also, but grep does not match newlines which keeps matches shorter than a single line).

Check if you need to exclude \n in your patterns. For this reason, the \n is excluded automatically from negative patterns [^...] and excluded from \s (Unicode space) but included with [[:space:]] even though GNU grep does not match \n with [[:space:]]. Using \s is the same with ugrep -U so just use \s+ instead of [[:space:]]+. These are mentioned in the README under "POSIX and Unicode character categories", but understandably the README is not the first thing to look into. Perhaps I should add some wording explaining the reason for \s and the old [[:space:]] to be different.

All of this is helpful to make ugrep more powerful than grep, but there are minor details like this to keep in mind as ugrep's regex patterns have virtually no restrictions what can be matched. The more expressive ugrep patterns are used by all program language pattern matching definitions in the patterns directory, which will not work with grep for this reason.

Another option to consider is -P to switch to Perl regex matching, with patterns that also match newlines. By contrast, grep -P does not match newlines.

mohd-akram commented 2 years ago

It seems the culprit is single character matches, running ugrep -o "[{}:,]" large.json takes 23 seconds in ugrep, 1.5 seconds in grep. Might have to do with output buffering, perhaps ugrep is flushing the output too often (eg. with every match).

genivia-inc commented 2 years ago

The json lines are indeed quite long and there are a lot of matches per line.

The short length matching is currently done with a bitap-like algorithm combined with hashing. This combination has shown to perform very well against other grep for longer patterns than just one character and when the possibilities are more than just a few possible characters to match. In your example there are only four choices for which the current algorithm does not perform as well, e.g. compare this with ugrep -o [{}] for example. Also ugrep -c [{}:,] runs as long as the -o option, but this can be improved in an update, because we can skip to the next line since -c only reports the number of lines matched (ugrep also allows -co to count matches, not lines).

It should not be too difficult to improve this case, because the algorithms are selected based on the pattern's characteristics which in this case has a minimum length of one and maximum length of one. The current algorithm in this particular case does not use SIMD to find a match but checks if a character has a specific entry in a 256 byte table of all possible chars that can match. This is fast (enough) for most cases, but with a few characters that can possibly match there are better matching algos.

PS. Output flushing does not happen for every match but in chunks once the output buffer is full. By contrast, line buffered mode flushes constantly, but that is not enabled by default.

genivia-inc commented 2 years ago

OK, I found that the slower speed is caused by input buffer (re)allocation when lines are very long. It appears that the input buffer keeps growing to accommodate long lines (even though the line doesn't need to be stored for options -o and -c). But what is more unexpected to me is that a small buffer of 256 bytes that continues to grow greatly outperforms a large initial buffer of >64K by more than 10x !! Something about this is not working as expected or isn't working correctly at the moment for these later versions. I will investigate.

The good news is that we know what the problem is that is correctable to get the speed back up to what it should be.

genivia-inc commented 2 years ago

My initial ugrep experiments to correct input buffering issues are very promising (time=0.266s):

time src/ugrep -o "[{}:,]" > /tmp/junk.txt ~/Downloads/large2.json
0.266u 0.012s 0:00.28 96.4% 0+0k 0+0io 0pf+0w

which is more than 2x faster than grep (time=0.601s):

time grep -o "[{}:,]" > /tmp/junk.txt ~/Downloads/large2.json
0.601u 0.008s 0:00.61 98.3% 0+0k 0+0io 2pf+0w

With option --format='%o%~' for ugrep instead of -o, its REALLY fast (time=0.049s):

time src/ugrep --format="%o%~" "[{}:,]" > /tmp/junk.txt ~/Downloads/large2.json
0.049u 0.012s 0:00.06 83.3% 0+0k 0+0io 0pf+0w

The reason for the speed difference is the line number checking overhead, because just counting the number of matches without counting lines or checking for line boundaries is the fastest (time=0.040s):

time src/ugrep -co "[{}:,]" > /tmp/junk.txt ~/Downloads/large2.json
0.040u 0.009s 0:00.05 80.0% 0+0k 0+0io 0pf+0w

There is nothing wrong with the pattern matching algos, which are blazingly fast. The problem is twofold 1) binary file checking took way too long by checking the entire first line, which in this case is very long. It suffices to check the first 4K or so for binary content (zeros and invalid UTF); 2) the input buffering mechanism suffers performance degradation after a few updates. Will be fixed asap.

mohd-akram commented 2 years ago

Ah, so it's not the output buffering but the input one! Great news!

genivia-inc commented 2 years ago

The 3.3.10 update is ready. The update shows it's 3x faster than GNU grep for your ugrep -o "[{}:,]" test case, using clang 12.0.0 -O2 tested on a 2.9 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3 MacOS 10.15.7. Other tests show that performance is back to what it should be (for files with very long lines). The problem was caused by a later SIMD optimization applied to the line counting method. This was exacerbated by an increase in the buffer size. As a result, the line counting was repeated for every match rather than on demand as intended, due to the wrong pointer value being used. The problem was not noticeable with "normal" files with lines of normal average lengths. Your case was helpful to pinpoint the problem and fix it. Thanks for reporting it!

One thing that I could not further optimize for files with very long lines to outperform GNU grep is option -c. The reason is that ugrep may match multiple lines if the pattern includes newlines. This means that GNU/BSD grep's trick to speed up -c by simply advancing to the next line without matching anything extra beyond the first match on the line will not be trivial to replicate in ugrep, at least not without losing the ability to reliably match multiple lines by patterns that match newlines. When this trick is used with ugrep, the count produced by -c will be incorrect when we advance to the next line without matching the remainder of the line that may cross a newline. Having said that, we could check if a pattern has any newlines to match, but that is not trivial either. My concern is that we don't want to get it wrong and introduce a bug. So for the time being, option -c for files with very long lines can be slower with ugrep compared to GNU grep and we know why. But that should in general not be the case for "normal" files with lines lengths in a reasonable range of hundreds of characters, not thousands.

mohd-akram commented 2 years ago

Tried it out and it works super fast now! It's down to 0.7 seconds, that's crazy. Thank you!