Genivia / ugrep

NEW ugrep 7.1: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more
https://ugrep.com
BSD 3-Clause "New" or "Revised" License
2.66k stars 111 forks source link

"Empty expression" error when using `\A` inside group #439

Closed alvin55531 closed 1 week ago

alvin55531 commented 2 weeks ago

If I put \A inside a capture group or non-capture group, it'll give an error: Command:

ugrep -P '(?:\A|.*Testword5)

(I have a file that contains "Testword5" for this test search)

Output:

ugrep: error: error at position 9
(?m)(?:\A|.*Testword5)
         \___empty expression

I've looked through the man page. The only thing that seems relevant is the --empty flag, but using it gives the same results.

genivia-inc commented 2 weeks ago

Indeed, \A can only be used as an anchor when followed by some non-empty pattern. For example \A.* will work. Anchors are not boundaries. Boundaries can be used anywhere. Anchors are more restrictive to anchor a (or more typically all) matches

genivia-inc commented 2 weeks ago

I should add that matching a single \A is not possible. Only the ^ and $ anchors can be used to match without a pattern, but the \A and \Z anchors need a pattern as "context" to assert the begin-of-file match.

alvin55531 commented 2 weeks ago

Thank you for the clarification!

So the non-empty expression must be inside the capture group/non-capture group with the \A? So it has to be (?:\A.*|.*Testword) and not (?:\A|.*Testword)someotherpattern?

Is this a limitation set by ugrep (perhaps for performance?) or a limitation of the regex engine ugrep uses. I tried the original regex (on Regex101) with PCRE2 and it would match successfully.

genivia-inc commented 2 weeks ago

This is to avoid confusion and problems, i.e. when \A does not match anything when it isn't followed by a pattern. The syntax check only applies to the regex syntax, not to its meaning. Regex can be arbitrarily complex. An accurate check can only be done in the DFA, but that is not useful to find the location in the regex that caused it and it won't work with option -P (PCRE2) that also does not produce the expected output for a sole \A without a pattern that follows it.

For regex like (\A|aaa)bbb one can also write (\Abbb|aaabbb) which is the same.

genivia-inc commented 1 week ago

Is this a limitation set by ugrep (perhaps for performance?) or a limitation of the regex engine ugrep uses. I tried the original regex (on Regex101) with PCRE2 and it would match successfully.

Using a sole \A doesn't work with PCRE2 using ugrep option -P when I temporarily removed the limitation. PCRE2 only matches lines with "test" with the regex (\A|test).

Therefore, disallowing a sole \A when not followed by a pattern prevents problems and confusion. Sure, something like (\A|test)ing should match ing at the start of a file, but this is the same as writing the regex `\Aing|testing. Any factored regex can be expanded this way and works with ugrep just fine.