[RFE] block matching in ugrep

trantor commented 5 months ago

Hello hello. I've been wondering whether it could be feasible to implement a block-level or paragraph match mode for ugrep. There are a few implementations in the wild doing just that, but they usually are dedicated tools.

By block-level I mean a tool which would match on a block of contiguous lines defined by regex patterns matching one a start line and another an end line of the block. It would be interesting in order to allow, for instance, to match and return only matching blocks or only non-matching blocks against a given pattern.

Thanks in advance.

genivia-inc commented 5 months ago

This works out-of-the-box with ugrep with lazy quantifiers and \n newline matching. The block of text between two matching lines is matched by the lazy pattern (.*\n)*?, which is followed by the ending line pattern to end the lazily-matching block.

Given a BEGIN and an END pattern to match the begin and end line of the block to match:

$ ug 'BEGIN.*\n(.*\n)*?.*END' FILE

No new features are necessary. But perhaps the syntax is a bit off-putting, although it is short and only requires placing the pattern .*\n(.*\n)*?.* between BEGIN and END.

To match a BEGIN and an END pattern also on the same line as well as a block:

$ ug 'BEGIN(.*\n)*?.*END' FILE

genivia-inc commented 5 months ago

For example, to block-match C++ /*-*/ block comments look for regex /\* then match text with lazy regex (.*\n)*? and end with the regex .*\*+\/:

% ug '/\*(.*\n)*?.*\*+\/' file.cpp

trantor commented 5 months ago

Thanks @genivia-inc . Just an additional question at this point. What if I wanted to invert the match and output only the blocks that don't match a specific pattern but do match the BEGIN and END criteria? Would that require a 2-step process, one where I first match blocks using the BEGIN and END regexes as start and end of the matching and a second one where I invert the match and use BEGIN and END patterns with the matching pattern inbetween them? (I hope I was clear enough.)

genivia-inc commented 5 months ago

Option -v inverts matches. But that won't include the begin and end line, which is what you want.

To include begin and end lines, it will be necessary to use option -P with three patterns:

from the start of the file anchor \A to a line that matches BEGIN: \A(.*\n)*?.*BEGIN
from a line that matches END to a line that matches BEGIN, i.e. inverted order: END.*\n(.*\n)*?.*BEGIN
from a line that matches END to the end of the file anchor \Z: END(.*\n)*\Z

$ ug -P -e '\A(.*\n)*?.*BEGIN' -e 'END.*\n(.*\n)*?.*BEGIN' -e 'END(.*\n)*\Z' FILE

where the third pattern doesn't need to use a lazy quantifier since it matches greedily to the end of the file. Because of this lazy/non-lazy "conflict" at a zero-width anchor \Z, it's best to combine the second and third patterns:

$ ug -P -e '\A(.*\n)*?.*BEGIN' -e 'END.*\n(.*\n)*?(.*BEGIN|\Z)' FILE

All of this assumes that BEGIN ... END isn't nested. Nested blocks cannot be handled with regex, because it requires a context-free grammar, not a linear grammar. To do so requires a parser.

genivia-inc commented 5 months ago

I've updated the previous post, because anchors \A and \Z require option -P for Perl regex (with PCRE2) for this to work. I forgot to mention this. Otherwise, with POSIX lazy matching and anchors the matches are too long.

I'll take a look at the anchors with POSIX lazy quantifiers (without option -P), which in principle should work find with POSIX regex matching. But it seems that the greediness of my POSIX regex engine matches too much when anchors are used.

genivia-inc commented 5 months ago

OK. I've made a few minor changes to permit the use of anchors such as \A and \Z with lazy patterns, so -P is not necessary to accomplish the same here.

Will release as update 5.1.1 soon.

genivia-inc commented 5 months ago

Fixed. It appeared to be a minor problem with the DFA construction algorithm that was updated, causing anchors combined with lazy quantifiers to become too greedy. I have a set of unit tests. But tests are not 100% covering combinations like this one.

genivia-inc commented 5 months ago

I'm adding this issue to the Discussions tab, so folks who have the same question can find an answer (that is not closed).

vt-alt commented 4 months ago

Is it possible to print a whole paragraph (text separated by empty lines) if paragraph contains some match?

For example I get 3 matches with awk:

ugrep (master)$ git log | awk '/released 4.5/' RS= ORS='\n\n' | cat -A
    released 4.5.2$
    $
    - PR #344$
    - PR #343$
    - fix 7zip search on 32 bit systems when option -M (or -t setting -M) is used$
$
    released 4.5.1 fix bzip3/7zip configure interference$
    $
    Reported here:  https://github.com/Genivia/ugrep-indexer/issues/9$
$
    released 4.5.0$
    $
    - add 7zip archive search with option -z, fix #185$
    - apply Unicode normalization to combining characters in regex patterns, fix #298$
    - improved TUI TAB directory navigation when searching from the FS root$
    - updated ugrep.exe option -P to use PCRE2 10.42$
$

(Added cat -A so it visible it is actually 3 paragraphs.)

I heard that some old grep implementations had -p option for this.

genivia-inc commented 4 months ago

Never heard of the -p option, so that's an interesting historical note. In BSD grep option -p works as follows: "If -R is specified, no symbolic links are followed. This is the default." so it's definitely not something universal.

Use .*PATTERN(.|\n)*?\n\n to match a bunch of lines that end with an empty line. This uses the lazy quantifier (.\n)*? to match anything, including newlines up to the \n\n that follows the lazy pattern. The second \n will however also trigger a match for the line that follows, which is also shown as a result. This can be suppressed with option -o:

$ ug -o '.*PATTERN(.|\n)*?\n\n' file.txt

vt-alt commented 4 months ago

Thanks for the answer!

I just remembered this -p was from AIX: https://www.ibm.com/docs/fr/aix/7.1?topic=g-grep-command

vt-alt commented 4 months ago

Well, starting with .*PATTERN... it only prints the first matching line of paragraph until its end, not the whole paragraph (like with awk example). Example

$ git log | ug '.*ugrep-indexer/issues(.|\n)*?\n\n' | cat -A
    Reported here:  https://github.com/Genivia/ugrep-indexer/issues/9$
$
commit 7f18bb90c1b548e0749d660d27d1c501ed80f8a4$

Also, this commit line seems extraneous. (With -o it disappears).

genivia-inc commented 4 months ago

Right, I had mentioned option -o. The reason is that the second \n in \n\n no longer separates the match from the "extra" line that is added to the result. That's because \n separates lines. Matching \n also includes what comes after.

$ ug -o '(\n.+)*PATTERN(.|\n)*?\n\n' file.txt

Other patterns are possible, but some may cause backtracking that impacts performance. Starting with an \n gets results quicker (caveat: but it won't include the first line of a file when that paragraph matches)

genivia-inc commented 4 months ago

To avoid the extra line without -o, a lookahead can be used but this is lookahead is not documented because it can only be used at the end of a pattern, not inside like Perl option -P allows:

$ ug '(\n.+)*.*PATTERN(.|\n)*?\n(?=\n)' file.txt

vt-alt commented 4 months ago

Matching \n also includes what comes after.

That's non-intuitive. Also, this approach (as in the example) misses first and last paragraphs (because there could be no \n). So the task becomes kinda complicated.

Just thinking: maybe it would be cool to have additional pattern like -o that will be matched and printed IF second (-e) pattern matches inside of it.

genivia-inc commented 4 months ago

It is important to consider the fact that ugrep like other grep is not a tool like awk that separates input into records and records can be separated into fields.

Rather, ugrep is a (multi)line-oriented search tool. It does this efficiently in a streaming way, i.e. by not storing an entire file in memory. That is critical when searching recursively to not bog down a machine when files are large. In fact, ugrep has no limit on file size to search.

genivia-inc commented 4 months ago

Matching \n also includes what comes after.

That's non-intuitive. Also, this approach (as in the example) misses first and last paragraphs (because there could be no \n). So the task becomes kinda complicated.

No, it is not. Let me explain why. It is how the default matching works, i.e. anything that matches on a line results in outputting that entire line (unless option -o is used.) So when matching an \n we increment the line counter to the next line because from that point on we are on the next line already.

For example, say the input is:

abc\ndef\n

then matching c or c$ gives abc and matching c\nd gives abc\ndef and matching c\n or \nd also gives abc\ndef. The \n is just another character when matched.

You're probably thinking of \n\n as a record separator that is an empty line. But for grepping it isn't. It's just a sequence of characters to match and show in which lines it occurred.

vt-alt commented 4 months ago

I meant "non-intuitive" that \n is (somehow) appearing at the start of the line (so grepping for it also matches a full following line).

This is slightly different from ripgrep:

$ echo -e 'a\nb\nc\n' | ugrep -e 'b\n'
b
c
$ echo -e 'a\nb\nc\n' | rg -U -e 'b\n'
b
$

But when prepending \n they curiously behave the same:

$ echo -e 'a\nb\nc\n' | ugrep -e '\nb'
a
b
$ echo -e 'a\nb\nc\n' | rg -U -e '\nb'
a
b
$

Judging from this, for ripgrep, it seems, that \n is always at the end of the line (which is intuitive), but for ugrep it seems "something-not-at-this-line" and thus matching next/previous lines (or an 'non-existent' empty line for end of file).

I am not criticizing your approach, though, it's just curious observation.

Genivia / ugrep

[RFE] block matching in ugrep #369