--format option for bytes/charecters of context with limit

dingus9 commented 3 months ago

I have a situation where I am searching larger binary files like disc images and elf bins for PII strings. I have a format string that looks like "%[§]$%f%s%b%s%q%s%Q%~". This works well in many situations except sometimes the %Q, %O, %C outputs nearly the entire binary file... Probably due to the lack of endline characters.

Is there a way to do the equivalent of (%Q, %O, %C) that would do N bytes of context around the match? For example %Q32 would yield a limited C++ quoted escaped string of 32 bytes before and after a match.

Thanks

genivia-inc commented 3 months ago

This works well in many situations except sometimes the %Q, %O, %C outputs nearly the entire binary file... Probably due to the lack of endline characters.

Yes. The output of %Q, %O and %C is delineated by newlines/endlines.

You could use %o but this only outputs the match. But it may still work for you, because a match can be extended by adding bytes before and after it, like .{0,10}PATTERN.{0,10} where . (dot) matches any byte when you use option -U (this is to not match Unicode . (dot) characters but plain bytes.) The . (dot) pattern excludes newlines.

It might be useful to consider updating ugrep to include a format parameter to %o to specify a size of the context to output for the match, for example %[10]o. Or make this parameter limit the size of the pattern output. Don't know if this is something worthwhile.

dingus9 commented 3 months ago

It would be highly useful for us, our regex is exceptionally long as we are adding all of the PII info into one giant RE. (we had to recompile PCRE2 on our system to update the character limits to a 64bit word size) Anyhow it's like 12k characters or something that get's expanded by ugreps re optimization into like 34k. Anyhow ugrep churns through it pretty well, until we hit some weird bytes files that don't contain newline delimiters at all.

We also use -F fixed strings with a -f fixed strings input files in some situations with everyones names. Both would benefit from output format limiting. Now we use the output context to extract the surrounding context bytes or string etc. to get a visual representation of what the match looks like as well as process each matching context for further refinement before displaying to the CLI. Initially I was using just the bytes start location, with dd to extract the context, however I quickly ran into issues with scans on compressed files (gz, zip, etc.) those are a lot harder to extract a bytes location from. So I wen't back to %Q and then was post processing the lines with some bash code to truncate to 32b{match}32b. Then we ran into the k9s bug. A team member ran our script against a directory that happened to have k9s, that user also happened to search for kube in their -f fixed string file. The context blew up.

Example command to reproduce: (grab k9s from https://github.com/derailed/k9s) 91Mb binary -> yields over 32Gb output. ugrep --format="%[§§]$%f%s%b%s%q%s%Q%~" "kube" ./k9s

We scan a lot of "random" files as a QC check for some internal processes and our script has to handle unknown text and binary files gracefully. your %[10]o pattern would make things a lot easier on us. Another option I was considering was some how detecting too much text being received by our script and killing ugrep in that event.

genivia-inc commented 3 months ago

Very interesting use case! Thanks for sharing!

OK, so the large output is happening because we're matching very long lines and each match produces another long output with %O.

Note that %u can be used as a switch in the format to combine multiple matches on the same line to be output just once as a line. Still, this won't help with your problem to output matches with some byte context.

We can implement an extension of the %o, %q, %x, %c, %j, %v to optionally select context before/after the matching pattern to output. This could be %[n,m]o to output the matching pattern with n bytes/characters before and m bytes/characters after as context. The simpler form %[n] sets both n and m equal.

Would something like that work for you?

And perhaps %[n,k,m]o could output n bytes context before, k max bytes of the match (truncated), and m bytes context after. I don't know if that is also useful, it might help when the matching pattern is very long and we're not interested in seeing all of it.

Caveat: the n and m will be practically limited to the line endings, so the output will not extend beyond the line begin and ending even when n and m are large. We could go beyond, but there is no guarantee that we can go further before the line beginning, because the input buffer may have shifted out that part of the input that is no longer available. Or we have to also implement handlers similar to the way -ABC context works to retain lines before the matching line to show as context.

genivia-inc commented 3 months ago

I would also like to add the ability to output a group capture in CSV, JSON or XML which is currently not possible. A group capture is output with %[n]# but that's raw output. I want to add these new arguments to fields:

%[n]j       nth capture as JSON
%[n]x       nth capture as XML
%[n]v       nth capture as CSV
%[name]j    named capture as JSON
%[name]x    named capture as XML
%[name]v    named capture as CSV
%[n|...]j   capture n,... as JSON
%[n|...]x   capture n,... as XML
%[n|...]v   capture n,... as CSV

Now, we can also add a new field that specifies a context size to output a match. This can be combined with all of the above fields without adding unnecessary complexity to the arguments of these fields. These assume that option -P is used for group captures with PCRE2.

Note: %[n]o (raw group capture match, i.e. same as %[n]#) and %[n]c (C/C++ match) are also supported as extensions of the %o and %c fields.

genivia-inc commented 3 months ago

Is there a way to do the equivalent of (%Q, %O, %C) that would do N bytes of context around the match? For example %Q32 would yield a limited C++ quoted escaped string of 32 bytes before and after a match.

How about:

%[-n]O      n chars before the match
%[+n]O      n chars after the match 
%[-n]Q      quoted n chars before
%[+n]Q      quoted n chars after 
%[-n]C      n chars before as C/C++
%[+n]C      n chars after as C/C++ 
%[-n]J      n chars before as JSON
%[+n]J      n chars after as JSON 
%[-n]X      n chars before as XML
%[+n]X      n chars after as XML 
%[-n]V      n chars before as CSV
%[+n]V      n chars after as CSV

Not 100% convinced yet on my end that this is a good approach. It's a bit much with all these listed doing the same thing, essentially. Also, should these be n chars or n bytes? Chars (Unicode) is probably more reasonable. Also, the number of chars before/after the match will be less than the given n when we hit the line's begin or end. We don't want to extend this beyond the line.

genivia-inc commented 3 months ago

OK. A lot going on here on my end to extend the format fields for options --format and --replace.

I'm getting more comfortable with adding a couple of new fields and a new field argument {width} to control the width of a field and to support outputting the context of a match (see accompanying notes).

 field       output                      field       output
 ----------  --------------------------  ----------  --------------------------
 %%          %                           %[...]<     ... if %m = 1
 %~          newline (LF or CRLF)        %[...]>     ... if %m > 1
 %a          basename of matching file   %,          , if %m > 1, same as %[,]>
 %A          byte range of match in hex  %:          : if %m > 1, same as %[:]>
 %b          byte offset of a match      %;          ; if %m > 1, same as %[;]>
 %B %[...]B  ... + byte offset, if -b    %|          | if %m > 1, same as %[|]>
 %c          matching pattern as C/C++   %[...]$     assign ... to separator
 %C          matching line as C/C++      %[ms]=...%= color of ms ... color off
 %d          byte size of a match        --------------------------------------
 %e          end offset of a match       
 %f          pathname of matching file   Fields that require -P for captures:
 %F %[...]F  ... + pathname, if -H       
 %+          %F as heading/break, if -+  field       output
 %h          quoted "pathname"           ----------  --------------------------
 %H %[...]H  ... + "pathname", if -H     %1 %2...%9  group capture
 %j          matching pattern as JSON    %[n]#       nth group capture
 %J          matching line as JSON       %[n]b       nth capture byte offset
 %k          column number of a match    %[n]d       nth capture byte size
 %K %[...]K  ... + column number, if -k  %[n]e       nth capture end offset
 %l          last line number of match   %[n]j       nth capture as JSON
 %L          number of lines of a match  %[n]q       nth capture quoted
 %m          number of matches           %[n]x       nth capture as XML
 %M          number of matching lines    %[n]y       nth capture as hex
 %n          line number of a match      %[n]v       nth capture as CSV
 %N %[...]N  ... + line number, if -n    %[name]#    named group capture
 %o          matching pattern, also %0   %[name]b    named capture byte offset
 %O          matching line               %[name]d    named capture byte size
 %p          path to matching file       %[name]e    named capture end offset
 %q          quoted matching pattern     %[name]j    named capture as JSON
 %Q          quoted matching line        %[name]q    named capture quoted
 %R          newline, if --break         %[name]x    named capture as XML
 %s          separator (: by default)    %[name]y    named capture as hex
 %S %[...]S  ... + separator, if %m > 1  %[name]v    named capture as CSV
 %t          tab                         %[n|...]#   capture n,... that matched
 %T %[...]T  ... + tab, if -T            %[n|...]b   capture n,... byte offset
 %u          unique lines, unless -u     %[n|...]d   capture n,... byte size
 %[hhhh]U    U+hhhh Unicode code point   %[n|...]e   capture n,... end offset
 %v          matching pattern as CSV     %[n|...]j   capture n,... as JSON
 %V          matching line as CSV        %[n|...]q   capture n,... quoted
 %w          match width in wide chars   %[n|...]x   capture n,... as XML
 %x          matching pattern as XML     %[n|...]y   capture n,... as hex
 %X          matching line as XML        %[n|...]v   capture n,... as CSV
 %y          matching pattern as hex     %g          capture number or name
 %Y          matching line as hex        %G          all capture numbers/names
 %z          path in archive             %[t|...]g   text t indexed by capture
 %Z          edit distance cost, if -Z   %[t|...]G   all t indexed by captures
 --------------------------------------  --------------------------------------

Option -o changes the output of the %O and %Q fields to output the match only.

Options -c, -l and -o change the output of %C, %J, %X and %Y accordingly.

Numeric fields such as %n are padded with spaces when %{width}n is specified.

Matching line fields such as %O are cut to width when %{width}O is specified or
when %{-width}O is specified to cut from the end of the line.

Extra character context on a matching line before or after a match is output
when %{-width}o or %{+width}o is specified for match fields such as %o.

With these new {width} field parameters it is now possible to restrict the output of a matching line up to 80 characters %{80}O. We can also output a match with context, for example %{-10}o%o%{10}o shows a match %o with up to 10 characters of context before and after it. This applies to all match output in C/C++, CSV, JSON, XML and hex (new with %Y and %y.)

genivia-inc commented 3 months ago

Will this work for you?

I believe this is sufficiently flexible to cover many other use cases.

dingus9 commented 3 months ago

Yes I think those options specifically the %[-n]o width params are exactly what I need. I will also likely make use of the group matching format support at some point as well! Thanks

genivia-inc commented 3 months ago

Ugrep v6.4 is released.

dingus9 commented 3 months ago

Let me go check it out! Thanks

genivia-inc commented 3 months ago

Let me know if you have any questions.

Since you're also searching binary files, you may want to exclude them with -I if that's unwanted. Or use %j field to output matches as JSON strings, which will include codes for non-printable characters. Otherwise, %o will output raw matches so it can be garbled when matching binary files.

dingus9 commented 3 months ago

Hey tested the new functionality out and it's exactly what I needed. I'm using %o and then just replacing some of the non-printables as needed, which so far is working. Mostly \n, \r, \t, \f etc. if the come up. The k9s bin uses \f field delimiter internally for some kind of string sep in their source, so it's pretty obvious if the replacements aren't being done. Anyhow we want as much binary output contextually fed to users if it's a binary file. They can pump it to a hexdumper or something if they need to see what those bytes are in more detail.

Overall I'm super happy with the new capability.

Thank you

genivia-inc commented 3 months ago

Interesting use case. Thanks for sharing.

One possible addition I thought of that could be useful is to add fields to output text matches/lines or output hex when the match/line is binary. Like option -W. I don't know if that addition would be useful. Perhaps %i and %I could be used for that, which are still unassigned.

genivia-inc commented 3 months ago

I'm stupid. We can just let options -W and -X change the %o and %O field output. No need for more fields.

Genivia / ugrep

--format option for bytes/charecters of context with limit #414