Genivia / ugrep

NEW ugrep 6.5: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more
https://ugrep.com
BSD 3-Clause "New" or "Revised" License
2.56k stars 109 forks source link

Issue with `^`-anchor Pattern Matching in (u)grep #346

Closed stdedos closed 6 months ago

stdedos commented 7 months ago

Disclaimer: This is not probably related to ugrep - grep exhibits the same behavior. But I don't understand why. I hope it's okay to ask 🙏 I would've tried Discussions, if they were open to avoid polluting issues

I'm trying to filter ps faux with ps faux | grep -P '^(/home/u/.sdkman/candidates/java/11.0.21-tem/bin/java|prog)' --width.

However, the command never finishes.

Instead, both ps faux | grep x and ps faux | grep -P '(/home/u/.sdkman/candidates/java/11.0.21-tem/bin/java|prog)' --width finish within <<1s.

It seems that adding the ^ affects the result. "Most probably" (given ps fauxs result) adding ^ would return 0 results - and I am fine with that. Instead, my issue is that the command does not finish and stays up indefinitely.

What's up? 😕 What am I doing wrong?

genivia-inc commented 7 months ago

I checked this out, but I don't see the problem you're describing. I assume you're using the latest ugrep. Perhaps it depends on the machine and/or PCRE2 version? What machine and OS are you using? What PCRE2 version?

stdedos commented 7 months ago

I'm using:

$ ugrep -v
ugrep 4.5.2 x86_64-pc-linux-gnu +avx2; -P:pcre2jit; -z:zlib,bzip2,lzma,lz4,zstd
License: BSD-3-Clause; ugrep user manual:  https://ugrep.com
Written by Robert van Engelen and others:  https://github.com/Genivia/ugrep
Ugrep utilizes the RE/flex regex library:  https://github.com/Genivia/RE-flex

built today https://github.com/stdedos/ugrep/actions/runs/7567927185

genivia-inc commented 7 months ago

Perhaps it depends on the input file to search? I can't replicate this problem. If you are able to share the input that would help. Send it by email to me if it is sensitive.

stdedos commented 7 months ago

The input is not necessarily "sensitive", but it is dynamic: ps faux |

So ... try to find a process that has the full path listed in your system's ps faux (it will require a Linux system 😕) And then select another "basic" executable. Like bash

You can re-build the command then like so: | grep -P '^(/abs/path/to/soft/bin/executable|bash)'

Like I said, "somehow" that ^ changes the behavior of the command: whether it finishes or not

genivia-inc commented 7 months ago

I tried on Debian, but I have no issues executing this. Perhaps gdb/lldb --pid <ugrep-PID> shows what's going on with ugrep? Also, what happens when you don't use option -P but -E instead?

Send me the ps faux output with which you're seeing this problem, which I can then use with your query to try to replicate this problem on my end.

stdedos commented 7 months ago

Okay - E vs P seems to make a difference.

wrt to the trace:

(gdb) bt
#0  0x00007facb66d6c95 in ?? ()
#1  0x000000000001fffe in ?? ()
#2  0x00007facb618645b in __GI__IO_file_xsgetn (fp=0x55d1e9d76000, data=<optimized out>, n=1) at fileops.c:1296
#3  0x000055d1e9d74108 in ?? ()
#4  0x000055d1e9d75000 in ?? ()
#5  0x0000000000000002 in ?? ()
#6  0x000055d1e9d482d0 in ?? ()
#7  0x000055d1e9d482e0 in ?? ()
#8  0x000055d1e9d746c0 in ?? ()
#9  0x00007facb653007b in pcre2_jit_match_8 () from /lib/x86_64-linux-gnu/libpcre2-8.so.0
#10 0x000055d1e7cd3d0d in ?? ()
#11 0x000055d1e7cc46a9 in ?? ()
#12 0x000055d1e7cb4dda in ?? ()
#13 0x000055d1e7ccebd5 in ?? ()
#14 0x000055d1e7cae717 in ?? ()
#15 0x00007facb611b083 in __libc_start_main (main=0x55d1e7cae640, argc=3, argv=0x7ffe27586b48, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe27586b38)
    at ../csu/libc-start.c:308
#16 0x000055d1e7cafbee in ?? ()

however, at the moment, I don't have ddebs to "enhance" my stacktrace 😕

It seems indeed that the ps faux output is at blame and NOT the piping. Here's the output: t.t.7z.zip

stdedos commented 7 months ago

Something more with my symbols, but still things missing:

(gdb) bt
#0  0x00007f37a2987222 in ?? ()
#1  0x000000000001fffe in ?? ()
#2  0x00007f37a243645b in __GI__IO_file_xsgetn (fp=0x56287425e000, data=<optimized out>, n=1) at fileops.c:1296
#3  0x000056287425c028 in ?? ()
#4  0x000056287425d000 in ?? ()
#5  0x0000000000000002 in ?? ()
#6  0x000056287425c560 in ?? ()
#7  0x000056287425c570 in ?? ()
#8  0x000056287425c5c0 in ?? ()
#9  0x00007f37a27e007b in pcre2_jit_match_8 () from /lib/x86_64-linux-gnu/libpcre2-8.so.0
#10 0x00005628735d1d0d in reflex::PCRE2Matcher::next_match (method=1, this=0x56287425bff0) at ../include/reflex/pcre2matcher.h:387
#11 reflex::PCRE2Matcher::match (this=0x56287425bff0, method=1) at ../include/reflex/pcre2matcher.h:318
#12 0x00005628735c26a9 in reflex::AbstractMatcher::Operation::operator() (this=<optimized out>, this=<optimized out>) at ../include/reflex/absmatcher.h:276
#13 Grep::search (this=0x7ffc81b922c0, pathname=<optimized out>, cost=<optimized out>) at ugrep.cpp:10957
#14 0x00005628735b2dda in Grep::ugrep (this=0x7ffc81b922c0) at ugrep.cpp:8632
#15 0x00005628735ccbd5 in ugrep () at ugrep.cpp:8309
#16 0x00005628735ac717 in main (argc=3, argv=0x7ffc81b96498) at ugrep.cpp:4641
genivia-inc commented 7 months ago

Strange. It seems to suggest PCRE2 is hanging somewhere on file input. However, it should take input from the buffer I'm supplying. There isn't much info on this in the trace though.

You can rebuild ugrep from source with ./build.sh CXXFLAGS=-g CFLAGS=-g to include symbols and turn optimization off. It will run slow, but this helps debugging.

BTW. I have no credentials to access the zip you've provided.

stdedos commented 7 months ago

You can rebuild ugrep from source with ./build.sh CXXFLAGS=-g CFLAGS=-g to include symbols [...]

That is what I did, yes.

[...] and turn optimization off

Yeah, "traditional" .ddebs don't handle -O0 😅 Just -g (and later stripping)

BTW. I have no credentials to access the zip you've provided.

They are on the contact@genivia.com. Around the time the comment was posted. Check your filters or spam maybe?

genivia-inc commented 6 months ago

Will look at this soon. Got swamped right now. Also working on completing a new algorithm and its implementation to optimize matching "leading wildcards". It is a generic method, no hacks. Which is great. But testing is critical.

genivia-inc commented 6 months ago

I've run several tests now, without being able to replicate the problem you described with the t.t file redirected as standard input to ugrep -P or grep -P linked to ugrep i.e. cat t.t | grep -P <pattern>.

I've used the latest ugrep compiled with PCRE2 10.36, 10.38 and 10.42 on Debian and MacOS x64 and ARM64 devices. Tried ugrep -P, ugrep -G -P and grep linked to ugrep so ugrep emulates grep (essentially sets ugrep -G -.).