Genivia / ugrep

NEW ugrep 7.1: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more
https://ugrep.com
BSD 3-Clause "New" or "Revised" License
2.66k stars 111 forks source link

`(^|x)` causes error "empty expression" #427

Closed xrat closed 1 month ago

xrat commented 2 months ago

I tried to replace my GNU grep with ugrep and found that the pattern (^|x) which I happen to use at times causes error "empty expression":

$ ugrep --version | head -n 1
ugrep 6.5.0 x86_64-pc-linux-gnu +sse2; -P:pcre2jit
$ cat /etc/debian_version
11.11
$ grep --version | head -n 1
grep (GNU grep) 3.6
$ grep -E '(^|x)' <<< foobar
foobar
$ ugrep -E '(^|x)' <<< foobar
ugrep: error: error at position 6
(?m)(^|x)
      \___empty expression
genivia-inc commented 1 month ago

Will take a look at this. It works fine with the $ anchor, but ^ is handled differently internally.

genivia-inc commented 1 month ago

Just a quick follow-up note: I'm just theorizing here, but it appears that grep just outputs all lines with grep -E '(^|x)', so the ^ is just like any other empty pattern that matches all input lines. It doesn't do anything special, because grep -E -o '(^|x)' only outputs matches of x, nothing else, which means that it's internal machinery isn't using ^ at all in this case. Perhaps this some GNU/BSD grep peculiarity? Will check it out.

xrat commented 1 month ago

Please note that I was just providing a minimal example. My use case is (^|x)y.

stephentalley commented 1 month ago

I am also affected by this. IIRC, I was trying to match '(^|\s)x', which GNU grep handles without issue.

There may be more efficient ways to architect that expression, but I guess if ugrep is intended to be a drop-in replacement for GNU grep, it seems like it should support these types of expressions.

genivia-inc commented 1 month ago

This limitation of the ^ anchor is no longer present in the upcoming ugrep update:

$ ugrep -c '(^|\s)y' enwik8 --stats
23930
Searched 1 file in 0.087 seconds: 1 matching (100%)

GNU grep is 10x slower in this case:

$ /usr/bin/time ggrep -c -E '(^|\s)y' enwik8 
23930
        0.84 real         0.82 user         0.01 sys

See also #426