Bug? Before line not returning for me on specific file

Genivia / ugrep

NEW ugrep 6.5: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more

https://ugrep.com

BSD 3-Clause "New" or "Revised" License

2.56k stars 109 forks source link

Bug? Before line not returning for me on specific file #355

Closed JJenkx closed 6 months ago

JJenkx commented 6 months ago

Edit: After changing file from LF to CRLF the ug output properly outputs before line

ug -B2 -A1 -i -Z+-~1 "super quickly" awk.output.txt
  8802-
  8803- https://www.youtube.com/watch?v=DnEJrgc1BCk&t=2817s
  8804: its way super quickly into your
  8805-

Line 8803 should contain text. When I did this same test on a file with just a few lines using the same data the desired before line returned properly. See attached file used in my test case awk.output.txt

ug -B2 -A1 -i -Z+-~1 "super quickly" awk.output.txt
  8802-
  8803-
  8804: its way super quickly into your
  8805-

ug --version
ugrep 4.5.2 x86_64-pc-linux-gnu +avx2; -P:pcre2jit; -z:zlib,bzip2,lzma,lz4,zstd,brotli
License: BSD-3-Clause; ugrep user manual:  https://ugrep.com
Written by Robert van Engelen and others:  https://github.com/Genivia/ugrep
Ugrep utilizes the RE/flex regex library:  https://github.com/Genivia/RE-flex

genivia-inc commented 6 months ago

Thank you for your feedback and for the awk script. I will take a look at this asap in the next day(s). Looks like an issue that I have to fix. The CRLF pair can be tricky to work with sometimes in text processing code for various reasons. I suppose a CR ends the line and cuts the output of the next line for some reason. It's a bit of a surprise, since I have many tests to check ugrep, beyond the ones included with the installation scripts.

JJenkx commented 6 months ago

If this issue is somehow related to this specifically formatted data, you can use this script to get more similar data. https://github.com/JJenkx/scripts/blob/main/subs

Send var "$formatted_transcript_file" to output file in function "read_and_format_transcript_file"

genivia-inc commented 6 months ago

I found the problem. The problem is caused by an input buffer shift to consume large files, which I've tested in the past. But in this case there is a hiccup in the logic to produce the "before lines context", something like an "off by one error". That will be fixed and released with version 5.0 which I've worked on daily for the last weeks to increase performance, consuming most of my time for testing and tuning (rinse and repeat).

genivia-inc commented 6 months ago

Problem fixed. The patch is to change two lines in ugrep.cpp in ContextGrepHandler::operator() to adjust the logic as follows:

    // functor invoked by the reflex::AbstractMatcher when the buffer contents are shifted out, also called explicitly in grep::search
    virtual void operator()(reflex::AbstractMatcher& matcher, const char *buf, size_t len, size_t num) override
[...]
      // if we only need the before context, then look for it right before the current lineno
      if (state.after_length >= flag_after_context)
      {
        size_t current = matcher.lineno();
        if (lineno + flag_before_context + 1 < current) // FIX
           lineno = current - flag_before_context - 1; // FIX
      }

The fix will be included in 5.0.