Genivia / ugrep

NEW ugrep 7.0: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more
https://ugrep.com
BSD 3-Clause "New" or "Revised" License
2.64k stars 112 forks source link

Supporting `kvm-ok` script #269

Closed stdedos closed 1 year ago

stdedos commented 1 year ago

Hello there! I have encountered the following issue, when "masking" ugrep as grep, and running the kvm-ok script (from sudo apt-get install cpu-checker):

$ kvm-ok 
ugrep: error: error at position 23
(?m)^flags[[:blank:]]*:)\>
     mismatched ( )___/

I am not sure how big of a wrench does "ignoring" closing ) (and other grouping symbols) when clearly none was opened throws at your plans.

Is that a specific flavor of regex (dis)allows?

All of the kvm-ok commands are egrep which is

$ which egrep
/usr/bin/egrep
$ ls -lah `!!`
ls -lah `which egrep`
-rwxr-xr-x 1 root root 28 Ιαν  29  2020 /usr/bin/egrep
$ bat -pp /usr/bin/egrep
#!/bin/sh
exec grep -E "$@"
genivia-inc commented 1 year ago

Thank you for your feedback.

By adding syntax error reporting to ugrep for clearly syntactically problematic regex patterns, I hope that the regex pattern error reporting in ugrep helps users to avoid problems and avoid getting confused when nothing matches or the wrong lines match.

I am aware that some regex pattern libraries simply match verbatim text in a regex when the text is syntactically incorrect regex, such as in the example you show with a closing ) that is missing an opening (. But I was not aware of this particular example.

We could make ugrep more lenient when it is aliased to grep perhaps. The tricky bit to do that is to make a list when GNU grep ignores grammatically incorrect regex pattern syntax. Does such as list exist?

stdedos commented 1 year ago

No, I don't think so 😕 I don't think there's a list, or a "prediction" in the spec. It is my opinion that, in simplified BNF terms, the grammar looks like this:

<ERE> ::= <term> | <ERE> "|" <term>

<term> ::= <factor> | <term> <factor>

<factor> ::= <atom> | <atom> <meta-char>

<atom> ::= <CHAR> | "(" <ERE> ")"

<meta-char> ::= "*" | "+" | "?"

<CHAR> ::= any non-meta character

An opening parenthesis makes the next closing parenthesis special, but otherwise "it's just a character".

I think it's just "being nice to the user" to avoid the "unnecessary" ), ], } escape (especially when it's the sole operator on the expression).

genivia-inc commented 1 year ago

Hm, I disagree. It makes no logical sense what you're saying. First, I agree that a proper grammar should define <CHAR> as a non-meta symbol to avoid ambiguity. But a ) is a meta character by the POSIX ERE regex standard. Also Flex and Lex do report regex syntax errors for a lone ) meta character "unbalanced parenthesis". Also sed reports an "RE error: parentheses not balanced". So do most regex libraries and utilities. There is something "special" about GNU grep. I guess it's a choice that (GNU) grep made, not a grammar choice. Nothing in the grep standard dictates that ) is a normal character when not matched with an opening (. Most likely the decisions was made to avoid errors when running scripts, so the number of errors is minimized at a price of potential mistakes in the regex patterns and matches made. However, I disagree with that choice, because it can lead to unwanted/undefined behavior.

stdedos commented 1 year ago

But a ) is a meta character by the POSIX ERE regex standard

Great - so I guess I am definitely wrong, and that is specified by the standard. "Parsers doing users a favor to ignore it" are wrong, apparently.

idk, I guess I am more frustrated when I copy-paste a string snippet that has unbalanced parenthesis, and my search has (activated the) Regex, and "it ruins my day" to track down that unbalanced ).

My train of thought goes like this: catching a hanging ( is much more powerful to "avoid mistakes" in regexp parsing.

If the user really wants to use (...), will definitely start with (, making this an obvious mistake; having e.g. )) caught as mistake is helpful, but easy-to-catch/avoid.

I understand your concerns. I hope it is possible to have a ugrep compatibility mode for GNU grep 🙃

genivia-inc commented 1 year ago

I will put this on my TODO list to work on something to do in the near future. Only allow non-matching closing ), ], and } perhaps when ugrep is aliased to grep or egrep.

genivia-inc commented 1 year ago

Perhaps reusing option -Y is the best way to ignore mismatching closing ), ], and }. It will apply to ugrep-aliased grep and egrep commands (fgrep doesn't use regex).

What do you think? Are other meta symbols also ignorable? I don't believe so, but will check.

stdedos commented 1 year ago

I don't necessarily like it* - but, technically, you are correct: -Y is also undoing one other gnu-grep deviation.

*: empty matches are really not-a-thing. It's a "weird" message to see, but a valid one none-the-less. Insofar, I have had no "empty matches" compatibility issues with ugrep.

Are other meta symbols also ignorable? I don't believe so, but will check.

I don't think so. Maybe some monkey-business with [^[] or equivalent. However, I haven't seen it, and I'll avoid speaking 😛

We can handle it in another issue, if it so happens that we missed something here.

genivia-inc commented 1 year ago

OK, option -Y will permit ), ] and } when no opening form is specified. This will match a ) if the opening ( is not provided in the regex pattern. This will be included in the next release.

I am actually changing the logic slightly to always permit a lone ] and } in patterns, which appears to be OK with other regex libraries. So why not do that too.

genivia-inc commented 1 year ago

Update 3.12.7 includes the proposed solution. Unpaired ) in regex are matched literally when option -Y is used and by the grep and egrep aliases of ugrep.

stdedos commented 1 year ago

Verified as fixed

$ sudo !$
sudo kvm-ok
INFO: /dev/kvm exists
KVM acceleration can be used
$ ugrep -v
ugrep: no PATTERN specified: specify --match or an empty "" pattern to match all input
For more help on options, try `ugrep --help' or `ugrep --help --match'
$ ugrep -V
ugrep 3.12.7 x86_64-pc-linux-gnu +avx2 +pcre2jit +zlib +bzip2 +lzma +lz4 +zstd
License BSD-3-Clause: <https://opensource.org/licenses/BSD-3-Clause>
Written by Robert van Engelen and others: <https://github.com/Genivia/ugrep>
$ which grep
/home/stdedos/.installs/bin/grep
$ ls -lah /home/stdedos/.installs/bin/grep
lrwxrwxrwx 1 stdedos stdedos 14 Ιουλ 15 23:02 /home/stdedos/.installs/bin/grep -> /usr/bin/ugrep
genivia-inc commented 1 year ago

Nice!