Genivia / ugrep

NEW ugrep 6.5: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more
https://ugrep.com
BSD 3-Clause "New" or "Revised" License
2.57k stars 109 forks source link

Problem with options -v -An for large files #319

Closed genivia-inc closed 9 months ago

genivia-inc commented 9 months ago

Observed with a benchmark 100,000,000 bytes enwik8 file to search the word the to output inverted matches -v with "after context" -A1:

ugrep -vA1 -n the enwik8 | wc
 1114216 12665271 100352570

The correct output should be:

ugrep -vA1 -n the enwik8 | wc
 1114310 12671469 100396462

The problem may happen with very large files with a high match count for the patterns specified, such as the word the in the large enwik8 Wikipedia file. An internal buffer shift adds 1 to a line number counter in function begin_before(), which is called by the InvertContextGrepHandler() functor that is triggered by at the buffer shift. This counts up one too many lineno when InvertContextGrepHandler() is also used to output context at the same time. This causes a missed line in the output.

Note: Fixed in the latest commit of v4.3.3-1. The output is now exactly the same byte-for-byte as GNU grep 3.11.