Closed stdedos closed 1 year ago
Thank you for your feedback.
By adding syntax error reporting to ugrep for clearly syntactically problematic regex patterns, I hope that the regex pattern error reporting in ugrep helps users to avoid problems and avoid getting confused when nothing matches or the wrong lines match.
I am aware that some regex pattern libraries simply match verbatim text in a regex when the text is syntactically incorrect regex, such as in the example you show with a closing )
that is missing an opening (
. But I was not aware of this particular example.
We could make ugrep more lenient when it is aliased to grep
perhaps. The tricky bit to do that is to make a list when GNU grep ignores grammatically incorrect regex pattern syntax. Does such as list exist?
No, I don't think so 😕 I don't think there's a list, or a "prediction" in the spec. It is my opinion that, in simplified BNF terms, the grammar looks like this:
<ERE> ::= <term> | <ERE> "|" <term>
<term> ::= <factor> | <term> <factor>
<factor> ::= <atom> | <atom> <meta-char>
<atom> ::= <CHAR> | "(" <ERE> ")"
<meta-char> ::= "*" | "+" | "?"
<CHAR> ::= any non-meta character
An opening parenthesis makes the next closing parenthesis special, but otherwise "it's just a character".
I think it's just "being nice to the user" to avoid the "unnecessary" )
, ]
, }
escape (especially when it's the sole operator on the expression).
Hm, I disagree. It makes no logical sense what you're saying. First, I agree that a proper grammar should define <CHAR>
as a non-meta symbol to avoid ambiguity. But a )
is a meta character by the POSIX ERE regex standard. Also Flex and Lex do report regex syntax errors for a lone )
meta character "unbalanced parenthesis". Also sed
reports an "RE error: parentheses not balanced". So do most regex libraries and utilities. There is something "special" about GNU grep. I guess it's a choice that (GNU) grep made, not a grammar choice. Nothing in the grep standard dictates that )
is a normal character when not matched with an opening (
. Most likely the decisions was made to avoid errors when running scripts, so the number of errors is minimized at a price of potential mistakes in the regex patterns and matches made. However, I disagree with that choice, because it can lead to unwanted/undefined behavior.
But a
)
is a meta character by the POSIX ERE regex standard
Great - so I guess I am definitely wrong, and that is specified by the standard. "Parsers doing users a favor to ignore it" are wrong, apparently.
idk, I guess I am more frustrated when I copy-paste a string snippet that has unbalanced parenthesis, and my search has (activated the) Regex, and "it ruins my day" to track down that unbalanced )
.
My train of thought goes like this: catching a hanging (
is much more powerful to "avoid mistakes" in regexp parsing.
If the user really wants to use (...)
, will definitely start with (
, making this an obvious mistake; having e.g. ))
caught as mistake is helpful, but easy-to-catch/avoid.
I understand your concerns. I hope it is possible to have a ugrep
compatibility mode for GNU grep 🙃
I will put this on my TODO list to work on something to do in the near future. Only allow non-matching closing )
, ]
, and }
perhaps when ugrep is aliased to grep or egrep.
Perhaps reusing option -Y
is the best way to ignore mismatching closing )
, ]
, and }
. It will apply to ugrep-aliased grep and egrep commands (fgrep doesn't use regex).
What do you think? Are other meta symbols also ignorable? I don't believe so, but will check.
I don't necessarily like it* - but, technically, you are correct: -Y
is also undoing one other gnu-grep deviation.
*: empty matches are really not-a-thing. It's a "weird" message to see, but a valid one none-the-less. Insofar, I have had no "empty matches" compatibility issues with ugrep
.
Are other meta symbols also ignorable? I don't believe so, but will check.
I don't think so. Maybe some monkey-business with [^[]
or equivalent.
However, I haven't seen it, and I'll avoid speaking 😛
We can handle it in another issue, if it so happens that we missed something here.
OK, option -Y
will permit )
, ]
and }
when no opening form is specified. This will match a )
if the opening (
is not provided in the regex pattern. This will be included in the next release.
I am actually changing the logic slightly to always permit a lone ]
and }
in patterns, which appears to be OK with other regex libraries. So why not do that too.
Update 3.12.7 includes the proposed solution. Unpaired )
in regex are matched literally when option -Y
is used and by the grep and egrep aliases of ugrep.
Verified as fixed
$ sudo !$
sudo kvm-ok
INFO: /dev/kvm exists
KVM acceleration can be used
$ ugrep -v
ugrep: no PATTERN specified: specify --match or an empty "" pattern to match all input
For more help on options, try `ugrep --help' or `ugrep --help --match'
$ ugrep -V
ugrep 3.12.7 x86_64-pc-linux-gnu +avx2 +pcre2jit +zlib +bzip2 +lzma +lz4 +zstd
License BSD-3-Clause: <https://opensource.org/licenses/BSD-3-Clause>
Written by Robert van Engelen and others: <https://github.com/Genivia/ugrep>
$ which grep
/home/stdedos/.installs/bin/grep
$ ls -lah /home/stdedos/.installs/bin/grep
lrwxrwxrwx 1 stdedos stdedos 14 Ιουλ 15 23:02 /home/stdedos/.installs/bin/grep -> /usr/bin/ugrep
Nice!
Hello there! I have encountered the following issue, when "masking" ugrep as
grep
, and running thekvm-ok
script (fromsudo apt-get install cpu-checker
):I am not sure how big of a wrench does "ignoring" closing
)
(and other grouping symbols) when clearly none was opened throws at your plans.Is that a specific flavor of regex (dis)allows?
All of the
kvm-ok
commands areegrep
which is