Open microo8 opened 2 years ago
Interesting, thanks for the report! This is a tricky one. It seems that GNU Gawk (and other AWKs) allow you to set RS
at any time when reading from an input file, and it'll dynamically update RS
and then read/parse the rest (the unread part) of the file. However, GoAWK uses bufio.Scanner
on each input file, which doesn't have an API that allows dynamically updating this as you read (some of the data read would still be in its buffer).
I can reproduce your case if I save your input file to rstest.in
and the program to rstest.awk
:
$ gawk -f rstest.awk rstest.in
UNA:+,?
UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL
UNH+1+MSCONS:D:04B:UN:2.3
BGM+7+EC33218279A-1+9
DTM+137:202012310206:203
RFF+Z13:13018
NAD+MS+4042805000102::9
NAD+MR+4016001000655::9
UNS+D
NAD+DP
LOC+172+DE00108108359V0000000000000088446
DTM+163:202012300000?+01:303
$ goawk -f rstest.awk rstest.in
UNA:+,?
... lots more blank lines ...
303
$
However, that program doesn't work in original-awk
or mawk
either, I guess because of the use of the Gawk-only RT
variable. Here's a more portable program that shows the same "dynamic setting of RS" issue:
$ cat rstest2.awk
NR==1 { RS=substr($0,9,1) }
NR>1 { print $0 }
$ cat rstest.in rstest.in >rstest2.in
$ gawk -f rstest2.awk rstest2.in # original-awk and mawk have the same output now
UNA:+,?
UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL
UNH+1+MSCONS:D:04B:UN:2.3
BGM+7+EC33218279A-1+9
DTM+137:202012310206:203
RFF+Z13:13018
NAD+MS+4042805000102::9
NAD+MR+4016001000655::9
UNS+D
NAD+DP
LOC+172+DE00108108359V0000000000000088446
DTM+163:202012300000?+01:303
$ goawk -f rstest2.awk rstest2.in
UNA:+,? 'UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL'UNH+1+MSCONS:D:04B:UN:2.3'BGM+7+EC33218279A-1+9'DTM+137:202012310206:203'RFF+Z13:13018'NAD+MS+4042805000102::9'NAD+MR+4016001000655::9'UNS+D'NAD+DP'LOC+172+DE00108108359V0000000000000088446'DTM+163:202012300000?+01:303
$
To work around this in GoAWK for now, I'd recommend actually reading (part of) the file twice. Note how rstest.in
is specified twice on the command line. This works in GoAWK and other AWKs:
$ cat rstest3.awk
NR==1 { RS=substr($0,9,1); next }
NR!=FNR { print $0 }
$ goawk -f rstest3.awk rstest.in rstest.in
UNA:+,?
UNB+UNOC:3+4042805000102:14+4016001000655:14+201231:0206+EC33218279A++TL
UNH+1+MSCONS:D:04B:UN:2.3
BGM+7+EC33218279A-1+9
DTM+137:202012310206:203
RFF+Z13:13018
NAD+MS+4042805000102::9
NAD+MR+4016001000655::9
UNS+D
NAD+DP
LOC+172+DE00108108359V0000000000000088446
DTM+163:202012300000?+01:303
$
That said, I think this is a bug (or at least a quirk) of GoAWK, so I'm going to leave it open. I'm not sure the best way to fix it without revamping the use of bufio.Scanner
. I think I'd need a scanner variant that can transfer the remaining/buffered bytes to a new scanner we dynamically changing RS
.
@arnoldrobbins, any thoughts on this? Where is this behaviour (that one can change RS
part way through a file) documented, or is it just assumed that this will work? I couldn't find it explicitly documented from a scan of RS in the Gawk manual, though I may have missed it.
It's just assumed it will work. RS
is like any other variable that you can change at any time you like. I agree with your assessment, that this is a bug in GoAWK. In C this is handled fairly naturally; there's a buffer, RS matches the end of the text, and then you start again with whatever is in the current value of RS
to find the next end of the buffer (with appropriate buffer management and filling from the file). HTH.
I think what I'll do here (at some point) is copy the bufio.Scanner
implementation into the GoAWK codebase, add a Buffered() io.Reader
method (similar to encoding/json
's Decoder.Buffered
), and then use that if changing RS in the middle of reading a file. If Buffered()
works out well, propose adding Buffered
to Go's bufio.Scanner
.
I remember this fun example in the Gawk book that uses RS+print to implement sed-like find-and-replace - the RS is updated in every cycle of the implicit loop while reading the input. The idea is credited to Mike Brennan, so probably it's portable to mawk at minimum.
Actually, at the moment, only gawk
supports RT
, which this program uses. Maybe one day RT
will find its way into other awks.
Oh, I missed that! Also, I just read the page again and RS
is only set once (in the BEGIN
block, which itself usually implies "once"). So I was wrong on multiple fronts :facepalm:
I've got a file where the first few bites define some of the attributes of the file. The 9th bite is the record separator.
I need to read this file, set
RS
and then read the file "again" but now separated by this new record separator.Input file (here the record separator is
'
):This works on GNU awk:
output:
but not on goawk: