Cannot use unescaped brace in regex

Genivia / ugrep

NEW ugrep 6.5: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more

https://ugrep.com

BSD 3-Clause "New" or "Revised" License

2.57k stars 109 forks source link

Cannot use unescaped brace in regex #340

Closed mohd-akram closed 8 months ago

mohd-akram commented 8 months ago

The parsing of braces is opportunistic in grep and other regular expression engines (I tried Node.js and Python). If it cannot parse a count, it parses it as a literal:

$ echo { | grep -E {
{
$ echo { | ugrep -E {
ugrep: error: error at position 4
(?m){
    \___empty expression

$ echo a{b} | grep -E a{b}
a{b}
$ echo a{b} | ugrep -E a{b}
ugrep: error: error at position 6
(?m)a{b}
      \___invalid repeat

mohd-akram commented 8 months ago

It turns out these regular expressions result in undefined behavior according to POSIX:

*+?{ The \, \, \, and \ shall be special except when used in a bracket expression (see RE Bracket Expression). Any of the following uses produce undefined results:

If these characters appear first in an ERE, or immediately following an unescaped \, \, \, or \

If a \ is not part of a valid interval expression (see EREs Matching Multiple Characters)

It should be noted that in JavaScript's unicode regular expression mode, these regular expressions are also not supported.

genivia-inc commented 8 months ago

Thank you for your feedback. Indeed, some forms have undefined behavior. I try to allow a few more special cases in "GNU grep" mode, i.e. when ugrep is renamed to grep (or fgrep, egrep) which auto-enables some options to be a bit more permissive with regex forms. I still like ugrep to produce an error message rather than accept undefined behavior and then decide what to make of it.