benhoyt / goawk

A POSIX-compliant AWK interpreter written in Go, with CSV support
https://benhoyt.com/writings/goawk/
MIT License
1.95k stars 84 forks source link

[Feature Request] Support case when end range pattern is not distinct from start pattern #164

Closed balki closed 1 year ago

balki commented 1 year ago

Example log file

ts:Jan 11 12:16:33 INFO Blah Blah Blah
ts:Jan 11 12:16:33 INFO Blah Blah Blah
ts:Jan 11 12:16:33 ERROR Got Exception in module foo
                         1. Traceback (most recent call last):
                         1.   File "/tmp/teste.py", line 9, in <module>
                         1.     run_my_stuff()
                         1. NameError: name 'run_my_stufff' is not defined
ts:Jan 11 12:16:33 INFO Blah Blah Blah
ts:Jan 11 12:16:33 INFO Blah Blah Blah
ts:Jan 11 12:16:33 INFO Blah Blah Blah
ts:Jan 11 12:16:33 INFO Blah Blah Blah
ts:Jan 11 12:17:33 ERROR Got Exception in module foo
                         2. Traceback (most recent call last):
                         2.   File "/tmp/teste.py", line 9, in <module>
                         2.     run_my_stuff()
                         2. NameError: name 'run_my_stufff' is not defined
ts:Jan 11 12:16:33 INFO Blah Blah Blah
ts:Jan 11 12:16:33 INFO Blah Blah Blah

I am trying to Extract the error line Got Exception in module foo along with the following traceback.

First attempt:

❯ goawk '/Got Excep/,/^ts:/' data.txt
ts:Jan 11 12:16:33 ERROR Got Exception in module foo
ts:Jan 11 12:17:33 ERROR Got Exception in module foo

This does not work because the end range expression /^ts:/, also matches the error line, so the range begins and ends with the single line. There is no easy way to match the last line of the exception or the next log line. Finally found a working solution but it is no longer an one-liner and is not straightforward to understand.

Solution:

❯ goawk '
1 { endcond = 0 }
/Got Excep/ , endcond {
    if (/^ts:/ && !/Got Excep/)
        endcond = 1
    else
        print $0
}
' data.txt

ts:Jan 11 12:16:33 ERROR Got Exception in module foo
                         1. Traceback (most recent call last):
                         1.   File "/tmp/teste.py", line 9, in <module>
                         1.     run_my_stuff()
                         1. NameError: name 'run_my_stufff' is not defined
ts:Jan 11 12:17:33 ERROR Got Exception in module foo
                         2. Traceback (most recent call last):
                         2.   File "/tmp/teste.py", line 9, in <module>
                         2.     run_my_stuff()
                         2. NameError: name 'run_my_stufff' is not defined

Can we have a command line flag or special syntax such that end pattern is not checked if it is the first line in the range? e.g.

❯ goawk --no-end-check '/Got Excep/,/^ts:/' data.txt

or use double comma (,,) to enable this behavior. This is currently a syntax error, so should be backwards compatible.

 ❯ goawk '/Got Excep/,,/^ts:/' data.txt
<cmdline>:1:13: expected expression instead of ,
/Got Excep/,,/^ts:/
benhoyt commented 1 year ago

Yes, this is slightly tricky, isn't it? I'd rather not introduce new syntax and range pattern types above and beyond POSIX here, so I'd suggest not using a range pattern for this, but two patterns with a flag. Similar to your endcond solution but a bit simpler (and one line :-).

You have to be careful with the order of the patterns, putting the /^ts:/ { e=0 } pattern-action first, so that the /Got Excep/ { e=1 } sets e to 1 for that first line before the e { print } pattern is evaluated, and the "Got Exception" line is printed:

$ goawk '/^ts:/ { e=0 }  /Got Excep/ { e=1 }  e { print }' data.txt
ts:Jan 11 12:16:33 ERROR Got Exception in module foo
                         1. Traceback (most recent call last):
                         1.   File "/tmp/teste.py", line 9, in <module>
                         1.     run_my_stuff()
                         1. NameError: name 'run_my_stufff' is not defined
ts:Jan 11 12:17:33 ERROR Got Exception in module foo
                         2. Traceback (most recent call last):
                         2.   File "/tmp/teste.py", line 9, in <module>
                         2.     run_my_stuff()
                         2. NameError: name 'run_my_stufff' is not defined

You can even shorten it slightly more by dropping the { print } on the last pattern, as that's the default:

$ goawk '/^ts:/ { e=0 }  /Got Excep/ { e=1 }  e' data.txt

The Gawk manual also has a couple of examples for range patterns that might be useful (though they don't quite fit what you're doing here).

Hope that helps!

balki commented 1 year ago

Thanks! Though not obvious at first glance, yet concise and clear.

$ goawk '/^ts:/ { e=0 }  /Got Excep/ { e=1 }  e' data.txt