Open benhoyt opened 3 years ago
I've thought about this a bit more, and I prefer the GoAWK behavior here, so I'm going to stick with it for now. Including the CR in the field seems against the spirit of FS=" "
splitting the fields on whitespace and stripping the whitespace.
I am confused by this spec of goawk.
With the exception of goawk, other awk implementations are consistent in their handling of newline characters. (Testing is done on Ubuntu 20.04)
$ printf "A\r\nB\rC\nD" | goawk 'BEGIN{RS="\n"} {printf $0}' | hexdump -C
00000000 41 42 0d 43 44 |AB.CD|
$ printf "A\r\nB\rC\nD" | mawk 'BEGIN{RS="\n"} {printf $0}' | hexdump -C
00000000 41 0d 42 0d 43 44 |A.B.CD|
$ printf "A\r\nB\rC\nD" | gawk 'BEGIN{RS="\n"} {printf $0}' | hexdump -C
00000000 41 0d 42 0d 43 44 |A.B.CD|
$ printf "A\r\nB\rC\nD" | busybox awk 'BEGIN{RS="\n"} {printf $0}' | hexdump -C
00000000 41 0d 42 0d 43 44 |A.B.CD|
$ printf "A\r\nB\rC\nD" | original-awk 'BEGIN{RS="\n"} {printf $0}' | hexdump -C
00000000 41 0d 42 0d 43 44 |A.B.CD|
If you prefer the GoAWK behavior, how about setting the default value of RS
to \r?\n
?
$ printf "A\r\nB\rC\nD" | goawk 'BEGIN{RS="\r?\n"} {printf $0}' | hexdump -C
00000000 41 42 0d 43 44 |AB.CD|
$ printf "A\r\nB\rC\nD" | mawk 'BEGIN{RS="\r?\n"} {printf $0}' | hexdump -C
00000000 41 42 0d 43 44 |AB.CD|
$ printf "A\r\nB\rC\nD" | gawk 'BEGIN{RS="\r?\n"} {printf $0}' | hexdump -C
00000000 41 42 0d 43 44 |AB.CD|
$ printf "A\r\nB\rC\nD" | busybox awk 'BEGIN{RS="\r?\n"} {printf $0}' | hexdump -C
00000000 41 42 0d 43 44 |AB.CD|
# See POSIX documentation below (nawk: awk version 20121220)
$ printf "A\r\nB\rC\nD" | original-awk 'BEGIN{RS="\r?\n"} {printf $0}' | hexdump -C
00000000 41 0a 42 43 0a 44 |A.BC.D|
# It is fixed in the on macOS 11.6.5 version of nawk (nawk: awk version 20200816)
$ printf "A\r\nB\rC\nD" | /usr/bin/awk 'BEGIN{RS="\r?\n"} {printf $0}' | hexdump -C
00000000 41 42 0d 43 44 |AB.CD|
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html
RS The first character of the string value of RS shall be the input record separator; a
by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a shall always be a field separator, no matter what the value of FS is.
In my opinion, portability is important. And in any case, we need a way to treat \r
as a normal character for compatibility.
Thanks, I'm going to reopen this issue to revisit this.
I don't want to have to care whether input text comes with \n or \r\n at the end of lines. And goawk makes this dream come true. With normal awk I can get code working on a unix-like system, deploy it to Windows (or process a file from Windows) and watch it crash and burn. Having to remember to say BEGIN{RS="\r?\n"} in every script is not a good solution.
Per discussion on issue #33 (from here down), GoAWK handles CR LF (Windows) line endings differently from gawk (I haven't tried awk or mawk). GoAWK doesn't include the CR in the field (because it's part of the line ending), whereas Gawk does. I'm not sure if there are differences between Gawk's handling on Windows and Linux.
I kinda think the GoAWK approach is more sensible and platform-native, but consistency with other AWKs is good too ... worth thinking about further.
Arnold Robbins said this: