benhoyt / goawk

A POSIX-compliant AWK interpreter written in Go, with CSV support
https://benhoyt.com/writings/goawk/
MIT License
1.92k stars 83 forks source link

Consider making handling of CR LF newlines more consistent with Gawk #51

Open benhoyt opened 3 years ago

benhoyt commented 3 years ago

Per discussion on issue #33 (from here down), GoAWK handles CR LF (Windows) line endings differently from gawk (I haven't tried awk or mawk). GoAWK doesn't include the CR in the field (because it's part of the line ending), whereas Gawk does. I'm not sure if there are differences between Gawk's handling on Windows and Linux.

I kinda think the GoAWK approach is more sensible and platform-native, but consistency with other AWKs is good too ... worth thinking about further.

Arnold Robbins said this:

Gawk is consistent . RS has the default value of \n and that is what terminates records. As far as gawk is concerned, the \r is no different from any other character, which is why it appears as part of the last field in the record.

That said, on Windows, I believe the default is to work in text mode, in which case gawk never sees the \r\n line ending, it only sees \n. One can use BINMODE to force gawk to see those characters, in which case you would need to set RS = "\r?\n" in order to get correct processing.

Take the Windows advice with a grain of salt. I have not used a Windows system directly in over two years, and when I did I used Cygwin, so some experimentation may be in order.

If one is processing a Windows file on Linux, then one should use a utility like dos2unix on the file, or tr, before sending the data to GoAwk, which does not (yet! hint, hint) allow RS to be a regular expression. Using GoAwk on Windows, well, you'll have to figure out what the Go runtime is handing off to your code.

benhoyt commented 2 years ago

I've thought about this a bit more, and I prefer the GoAWK behavior here, so I'm going to stick with it for now. Including the CR in the field seems against the spirit of FS=" " splitting the fields on whitespace and stripping the whitespace.

ko1nksm commented 2 years ago

I am confused by this spec of goawk.

With the exception of goawk, other awk implementations are consistent in their handling of newline characters. (Testing is done on Ubuntu 20.04)

$ printf "A\r\nB\rC\nD" | goawk 'BEGIN{RS="\n"} {printf $0}' | hexdump -C
00000000  41 42 0d 43 44                                    |AB.CD|

$ printf "A\r\nB\rC\nD" | mawk 'BEGIN{RS="\n"} {printf $0}' | hexdump -C
00000000  41 0d 42 0d 43 44                                 |A.B.CD|

$ printf "A\r\nB\rC\nD" | gawk 'BEGIN{RS="\n"} {printf $0}' | hexdump -C
00000000  41 0d 42 0d 43 44                                 |A.B.CD|

$ printf "A\r\nB\rC\nD" | busybox awk 'BEGIN{RS="\n"} {printf $0}' | hexdump -C
00000000  41 0d 42 0d 43 44                                 |A.B.CD|

$ printf "A\r\nB\rC\nD" | original-awk 'BEGIN{RS="\n"} {printf $0}' | hexdump -C
00000000  41 0d 42 0d 43 44                                 |A.B.CD|

If you prefer the GoAWK behavior, how about setting the default value of RS to \r?\n?

$ printf "A\r\nB\rC\nD" | goawk 'BEGIN{RS="\r?\n"} {printf $0}' | hexdump -C
00000000  41 42 0d 43 44                                    |AB.CD|

$ printf "A\r\nB\rC\nD" | mawk 'BEGIN{RS="\r?\n"} {printf $0}' | hexdump -C
00000000  41 42 0d 43 44                                    |AB.CD|

$ printf "A\r\nB\rC\nD" | gawk 'BEGIN{RS="\r?\n"} {printf $0}' | hexdump -C
00000000  41 42 0d 43 44                                    |AB.CD|

$ printf "A\r\nB\rC\nD" | busybox awk 'BEGIN{RS="\r?\n"} {printf $0}' | hexdump -C
00000000  41 42 0d 43 44                                    |AB.CD|

# See POSIX documentation below (nawk: awk version 20121220)
$ printf "A\r\nB\rC\nD" | original-awk 'BEGIN{RS="\r?\n"} {printf $0}' | hexdump -C
00000000  41 0a 42 43 0a 44                                 |A.BC.D|

# It is fixed in the on macOS 11.6.5 version of nawk (nawk: awk version 20200816)
$ printf "A\r\nB\rC\nD" | /usr/bin/awk 'BEGIN{RS="\r?\n"} {printf $0}' | hexdump -C
00000000  41 42 0d 43 44                                    |AB.CD|

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html

RS The first character of the string value of RS shall be the input record separator; a by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a shall always be a field separator, no matter what the value of FS is.

In my opinion, portability is important. And in any case, we need a way to treat \r as a normal character for compatibility.

benhoyt commented 2 years ago

Thanks, I'm going to reopen this issue to revisit this.

mikegleen commented 5 months ago

I don't want to have to care whether input text comes with \n or \r\n at the end of lines. And goawk makes this dream come true. With normal awk I can get code working on a unix-like system, deploy it to Windows (or process a file from Windows) and watch it crash and burn. Having to remember to say BEGIN{RS="\r?\n"} in every script is not a good solution.