Genivia / ugrep

NEW ugrep 7.1: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more
https://ugrep.com
BSD 3-Clause "New" or "Revised" License
2.66k stars 111 forks source link

Strange behavior in version 7 #434

Closed atarom closed 1 month ago

atarom commented 1 month ago

test.txt Hi, thanks for this excellent tool, I have been using it for some time without problems, but today when updating to version 7 I encountered this problem with some files like the attached one

$ ugrep '^addr:housenumber=' test.txt | ugrep -v '^addr:housenumber='
=286C
enumber=895

In previous versions I haven't had that problem, I have tried egrep and it works correctly with the same file

$ ugrep --version
ugrep 7.0.0 x86_64-pc-linux-gnu +avx2; -P:pcre2jit; -z:zlib,bzip2,lzma,lz4,zstd,7z,tar/pax/cpio/zip
License: BSD-3-Clause; ugrep user manual: <https://ugrep.com>
Written by Robert van Engelen and others: <https://github.com/Genivia/ugrep>
Ugrep utilizes the RE/flex regex library: <https://github.com/Genivia/RE-flex>

I only have one machine and my knowledge is very limited, can someone tell me if with version 7 and with the attached file, get the same wrong result as me?

Cheers

genivia-inc commented 1 month ago

Found the problem in the buffer window shifting in v7, after checking your example (thanks for sharing!). It's not the regex engine, but the alignment of the shifted window is off in the buffer internally. This was changed to accommodate hex line contexts, including after buffer shifting. Will release an update soon.

atarom commented 4 weeks ago

test2.txt.zip Hi, thank you very much for solving the previous bug so quickly. Today, with new data, I got this similar error with the current version 7.01

$ ugrep -v '^b5m:urlOrto=' test2.txt | ugrep '^drid'
drid.es/vgn-ext-templating/v/index.jsp?vgnextchannel=9e4c43db40317010VgnVCM100000dc0ca8c0RCRD&vgnextoid=924b7b05a8350410VgnVCM2000000c205a0aRCRD
$ ugrep --version
ugrep 7.0.1 x86_64-pc-linux-gnu +avx2; -P:pcre2jit; -z:zlib,bzip2,lzma,lz4,zstd,7z,tar/pax/cpio/zip
License: BSD-3-Clause; ugrep user manual: <https://ugrep.com>
Written by Robert van Engelen and others: <https://github.com/Genivia/ugrep>
Ugrep utilizes the RE/flex regex library: <https://github.com/Genivia/RE-flex>

and I don't have any line that start with "drid"

genivia-inc commented 4 weeks ago

GNU grep hits ^drid too:

$ ugrep -v '^b5m:urlOrto=' test2.txt | ggrep '^drid'
drid.es/vgn-ext-templating/v/index.jsp?vgnextchannel=9e4c43db40317010VgnVCM100000dc0ca8c0RCRD&vgnextoid=924b7b05a8350410VgnVCM2000000c205a0aRCRD

Also with PCRE2 matching the same happens:

$ ugrep -vP '^b5m:urlOrto=' test2.txt | ggrep -E '^drid'
drid.es/vgn-ext-templating/v/index.jsp?vgnextchannel=9e4c43db40317010VgnVCM100000dc0ca8c0RCRD&vgnextoid=924b7b05a8350410VgnVCM2000000c205a0aRCRD

but it does not return a match with ugrep 6.5, so something seems still off with the -v and ^ combination in the first regex with v7. I was sure was fixed after more testing on our end to verify the updated logic on the position of the begin-of-line assertion. Apparently not.

genivia-inc commented 4 weeks ago

Looks like a problem with the buffer-shift-event-handler logic, which depends on the buffer's shifted location at the begin-of-line. Because we keep a bit more data in the buffer to accommodate context hexlines, the handler may not always get a fix on the correct line to use, This affects option -v and likely also options -ABC when files are large.

The solution would be to revert back to the old logic, except when we have hex context lines because we don't use -v and -ABC in that case anyway.

genivia-inc commented 4 weeks ago

I've tested the change over back to the old logic and added the hex context lines update mentioned above. Both work like a charm. To make sure, I tested with your examples but also with different internal buffer sizes and diff'ed the results of them to spot any deviations that can be attributed to problems with the buffer shift logic and associated event handlers or the hex context lines logic for --hexdump.

Will release an update very soon.