Hexdump stops prematurely at LF

Genivia / ugrep

NEW ugrep 7.1: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more

https://ugrep.com

BSD 3-Clause "New" or "Revised" License

2.66k stars 111 forks source link

Hexdump stops prematurely at LF #428

Closed misutoneko closed 1 month ago

misutoneko commented 2 months ago

New user here, hi :D

I've only tested ugrep for a short while, but I think I've found a bug. To cut to the chase here's an example:

echo -e "Hello World\nThis part left unseen" |ugrep -U --hexdump=1A1 "\x57\x6f\x72"

Which will give you this:

00000000  48 65 6c 6c 6f 20 57 6f  |Hello Wo|
00000008  72 6c 64 0a -- -- -- --  |rldJ----|

So the problem is that the output doesn't continue all the way to the context boundary, it stops at the first sight of LF (0x0a). OK sure, greps are usually line-oriented, but imho the byte values in context shouldn't matter when dealing with the binary stuff? Tested with v3.7.2 and v6.5.0.

Btw hexdumping is THE main attraction of ugrep for me :D I have high hopes that ugrep will help me to replace my silly old 600+ line bash/awk script that I'm now using for dumping...

misutoneko commented 2 months ago

Here's a workaround:

echo -e "Hello World\nThis part left unseen" \ |ugrep -y -U --color=always --hexdump=1A1 "\x57\x6f\x72" \ |grep -A1 --color=never 31m

The -y in the first ugrep is passthru. The second (u)grep finds the lines based on the color code and prints them with the desired context added. Note that the search pattern 31m is ok for demo purposes, but it should be refined for RL usage.

EDIT: Actually that -y might eat all performance so it's better to restrict the first ugrep:

echo -e "Hello World\nThis part left unseen" \ |ugrep -U --color=always -A1 --hexdump=1A1 "\x57\x6f\x72" \ |ugrep --color=never -A1 31m

genivia-inc commented 1 month ago

As you point out, the hexdump feature is line-oriented, like grep in general is a line-oriented tool. The reason is the efficient buffering with a window that shifts out at line boundaries so it is possible to quickly search several GB of input without impacting memory resources.

A way to avoid this is to match newline \n byte as part of the pattern, but there are also reasons not to do that.

Perhaps this should be moved to the ugrep discussion so we can close this eventually.

misutoneko commented 1 month ago

Yeah I had a hunch it might be something like that. And I agree, if this is ever fixed it should be done in a way that doesn't compromise speed. The workaround, well, works for me well enough (it could break at the next version update though).

I guess only the repo owner can move to Discussions? Didn't spot any way to do that...

genivia-inc commented 1 month ago

I agree that there are opportunities to improve the hexdump feature with better context control. I haven't done that, because I didn't want to mess with the search engine.

What I would like to do is improve the hexdump context to always show the given number of hex lines before and after a match, regardless of the presence of newlines. That needs a change to the search engine so that the before hex lines (i.e. bytes) are not shifted out of the window when searching large files.

If will put that on the TODO list.

genivia-inc commented 1 month ago

The upcoming v7 release will produce hex before and after context lines regardless of LF boundaries. The before hex context is guaranteed to display, as long as the before hex size specified is not crazy large like hundreds of hex lines (in that case, the hex before context may get truncated by chance if the file is several MB).

misutoneko commented 1 month ago

Thank you! Works like a charm from what I can see :D

genivia-inc commented 1 month ago

The v7.0.2 update improves hex context lines when these run into each other. I want to avoid overlap, which can be confusing. So it looks a lot cleaner in the update, at least it does to me :)