benhoyt / goawk

A POSIX-compliant AWK interpreter written in Go, with CSV support
https://benhoyt.com/writings/goawk/
MIT License
1.94k stars 84 forks source link

Dynamically changing RS #143

Open microo8 opened 2 years ago

microo8 commented 2 years ago

I've got a file where the first few bites define some of the attributes of the file. The 9th bite is the record separator.

I need to read this file, set RS and then read the file "again" but now separated by this new record separator.

Input file (here the record separator is '):

UNA:+,? 'UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL'UNH+1+MSCONS:D:04B:UN:2.3'BGM+7+EC33218279A-1+9'DTM+137:202012310206:203'RFF+Z13:13018'NAD+MS+4042805000102::9'NAD+MR+4016001000655::9'UNS+D'NAD+DP'LOC+172+DE00108108359V0000000000000088446'DTM+163:202012300000?+01:303

This works on GNU awk:

BEGIN { RS=".{9}" }
NR==1 { $0=substr(RT,1,8); RS=substr(RT,9,1) }
{ print $0 }

output:

UNA:+,?
UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL
UNH+1+MSCONS:D:04B:UN:2.3
BGM+7+EC33218279A-1+9
DTM+137:202012310206:203
RFF+Z13:13018
NAD+MS+4042805000102::9
NAD+MR+4016001000655::9
UNS+D
NAD+DP
LOC+172+DE00108108359V0000000000000088446
DTM+163:202012300000?+01:303

but not on goawk:

UNA:+,? 
benhoyt commented 2 years ago

Interesting, thanks for the report! This is a tricky one. It seems that GNU Gawk (and other AWKs) allow you to set RS at any time when reading from an input file, and it'll dynamically update RS and then read/parse the rest (the unread part) of the file. However, GoAWK uses bufio.Scanner on each input file, which doesn't have an API that allows dynamically updating this as you read (some of the data read would still be in its buffer).

I can reproduce your case if I save your input file to rstest.in and the program to rstest.awk:

$ gawk -f rstest.awk rstest.in 
UNA:+,? 
UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL
UNH+1+MSCONS:D:04B:UN:2.3
BGM+7+EC33218279A-1+9
DTM+137:202012310206:203
RFF+Z13:13018
NAD+MS+4042805000102::9
NAD+MR+4016001000655::9
UNS+D
NAD+DP
LOC+172+DE00108108359V0000000000000088446
DTM+163:202012300000?+01:303

$ goawk -f rstest.awk rstest.in 
UNA:+,? 

... lots more blank lines ...

303

$

However, that program doesn't work in original-awk or mawk either, I guess because of the use of the Gawk-only RT variable. Here's a more portable program that shows the same "dynamic setting of RS" issue:

$ cat rstest2.awk
NR==1 { RS=substr($0,9,1) }
NR>1  { print $0 }
$ cat rstest.in rstest.in >rstest2.in
$ gawk -f rstest2.awk rstest2.in  # original-awk and mawk have the same output now
UNA:+,? 
UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL
UNH+1+MSCONS:D:04B:UN:2.3
BGM+7+EC33218279A-1+9
DTM+137:202012310206:203
RFF+Z13:13018
NAD+MS+4042805000102::9
NAD+MR+4016001000655::9
UNS+D
NAD+DP
LOC+172+DE00108108359V0000000000000088446
DTM+163:202012300000?+01:303
$ goawk -f rstest2.awk rstest2.in 
UNA:+,? 'UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL'UNH+1+MSCONS:D:04B:UN:2.3'BGM+7+EC33218279A-1+9'DTM+137:202012310206:203'RFF+Z13:13018'NAD+MS+4042805000102::9'NAD+MR+4016001000655::9'UNS+D'NAD+DP'LOC+172+DE00108108359V0000000000000088446'DTM+163:202012300000?+01:303
$ 

To work around this in GoAWK for now, I'd recommend actually reading (part of) the file twice. Note how rstest.in is specified twice on the command line. This works in GoAWK and other AWKs:

$ cat rstest3.awk 
NR==1   { RS=substr($0,9,1); next }
NR!=FNR { print $0 }
$ goawk -f rstest3.awk rstest.in rstest.in
UNA:+,? 
UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL
UNH+1+MSCONS:D:04B:UN:2.3
BGM+7+EC33218279A-1+9
DTM+137:202012310206:203
RFF+Z13:13018
NAD+MS+4042805000102::9
NAD+MR+4016001000655::9
UNS+D
NAD+DP
LOC+172+DE00108108359V0000000000000088446
DTM+163:202012300000?+01:303

$ 

That said, I think this is a bug (or at least a quirk) of GoAWK, so I'm going to leave it open. I'm not sure the best way to fix it without revamping the use of bufio.Scanner. I think I'd need a scanner variant that can transfer the remaining/buffered bytes to a new scanner we dynamically changing RS.

benhoyt commented 2 years ago

@arnoldrobbins, any thoughts on this? Where is this behaviour (that one can change RS part way through a file) documented, or is it just assumed that this will work? I couldn't find it explicitly documented from a scan of RS in the Gawk manual, though I may have missed it.

arnoldrobbins commented 2 years ago

It's just assumed it will work. RS is like any other variable that you can change at any time you like. I agree with your assessment, that this is a bug in GoAWK. In C this is handled fairly naturally; there's a buffer, RS matches the end of the text, and then you start again with whatever is in the current value of RS to find the next end of the buffer (with appropriate buffer management and filling from the file). HTH.

benhoyt commented 2 years ago

I think what I'll do here (at some point) is copy the bufio.Scanner implementation into the GoAWK codebase, add a Buffered() io.Reader method (similar to encoding/json's Decoder.Buffered), and then use that if changing RS in the middle of reading a file. If Buffered() works out well, propose adding Buffered to Go's bufio.Scanner.

janxkoci commented 3 months ago

I remember this fun example in the Gawk book that uses RS+print to implement sed-like find-and-replace - the RS is updated in every cycle of the implicit loop while reading the input. The idea is credited to Mike Brennan, so probably it's portable to mawk at minimum.

arnoldrobbins commented 3 months ago

Actually, at the moment, only gawk supports RT, which this program uses. Maybe one day RT will find its way into other awks.

janxkoci commented 3 months ago

Oh, I missed that! Also, I just read the page again and RS is only set once (in the BEGIN block, which itself usually implies "once"). So I was wrong on multiple fronts :facepalm: