Incorrect 'dump_out.nt' testcase string_multiline_12?

LewisGaul commented 3 years ago

The string_multiline_12 dumping testcase is failing on zig-nestedtext, and I think it's because of a bug in the testcase.

The input JSON contains the following, which looks as expected:

"the LF character \n\tASCII Linefeed (LF)\nthe CR character \r\tASCII Carriage Return (CR)\n"

But the expected output NestedText does not contain the CR - it is replaced with a LF:

$cat -A tests/official_tests/test_cases/string_multiline_12/load_in.nt
>$
> the BS character \^IBackslash (\)$
> the SQ character '^ISingle quote (')$
> the DQ character "^IDouble quote (")$
> the AB character ^G^IASCII Bell (BEL)$
> the BS character ^H^IASCII Backspace (BS)$
> the FF character ^L^IASCII Formfeed (FF)$
> the LF character $
> ^IASCII Linefeed (LF)$
> the CR character $
> ^IASCII Carriage Return (CR)$
> the HT character ^I^IASCII Horizontal Tab (TAB)$
> the VT character ^K^IASCII Vertical Tab (VT)$
> the ES character ^[^IASCII escape character as octal value$
> the ES character ^[^IASCII escape character as hex value$
>

I expected this to be:

$cat -A tests/official_tests/test_cases/string_multiline_12/load_in.nt
...
> the LF character $
> ^IASCII Linefeed (LF)$
> the CR character ^M> ^IASCII Carriage Return (CR)$
...

Admittedly this looks strange, but it can be explained as follows:

NestedText should be treating CR as a line ending (hence the >_ before the ^I, as with LF above)
WSL (where I ran this) doesn't treat CR as a line ending so displays it 'incorrectly'

Unless I'm missing something, I don't think the current behaviour can be justified as desirable because it means CR characters cannot be dumped (they're read in as 'generic line ending' and then dumped out as LF).

The way I handle this in zig-nestedtext is to keep the line endings of multiline string lines so that they can be preserved.

andrewchambers commented 3 years ago

I am also confused by this test case.

erikw commented 2 years ago

This is what the reference implementation does:

https://github.com/KenKundert/nestedtext/blob/master/nestedtext.py#L62

So I decided to do this as well for

https://github.com/erikw/nestedtext-ruby

to pass all the official tests.

LewisGaul commented 2 years ago

I still think there's a bug in this testcase, according to the language spec: https://nestedtext.org/en/stable/file_format.html

Quoting (emphasis mine):

Line breaks: A NestedText document is partitioned into lines where the lines are split by CR LF, CR, or LF where CR and LF are the ASCII carriage return and line feed characters. A single document may employ any or all of these ways of splitting lines.

String items: If the first non-space character on a line is a greater-than symbol followed immediately by a space (>␣) or a line break, the line is a string item. After comments and blank lines have been removed, adjacent string items with the same indentation level are combined in order into a multiline string. The string value is the multiline string with the tags removed. Any leading white space that follows the tag is retained, as is any trailing white space and all newlines except the last.

KenKundert commented 2 years ago

Sorry, for some reason GitHub is not notifying me of issues being raised in this repository.

Currently the Python version converts both \r\n and \r to \n when dumping. What is the value in retaining the \r? It will be interpreted as \n when the file is read back in.

LewisGaul commented 2 years ago

What is the value in retaining the \r? It will be interpreted as \n when the file is read back in.

Isn't one of the intentions with NestedText that you can represent any string? Converting \r to \n breaks that, breaking round-tripping of that input too.

FWIW I think the approach JSON takes is to escape special characters, so a literal "\r" (\\r) is needed to represent \r. I thought NestedText wouldn't want to go down this route though given the existing handling of whitespace.

KenKundert commented 2 years ago

The goal for NestedText was to represent any string without quoting or escaping. However, that is not possible. So I tend to say that a NestedText string can contain any printing character. But even that is not precisely true as a string may not contain line ending characters (\n and \r). Well, it can contain them, but it cannot distinguish them. That is because of the desire to read files produced on different operating systems, which causes us to interpret \r\n, \r, and \n as a line separators. So the goals of OS independence and allowing line ending characters in strings are in conflict.

As for round-tripping. It is supported for everything except line-ending characters. If we tried to maintain round-tripping on line-ending characters, we would end up with OS dependency issues.

KenKundert commented 2 years ago

One possible resolution to this issue would be to abandon \r as a valid line ending. It is only used by old Macs. I am far from the Apple sphere of influence, so I don't know if that would be a big loss. My belief is that nobody would notice us dropping support for \r as a line ending, and by doing so we would be a bit closer to our ideal of supporting all printing characters in strings. Furthermore, I think in the future most formats will settle on just supporting \r\n and \n as line endings.

LewisGaul commented 2 years ago

The goal for NestedText was to represent any string without quoting or escaping. However, that is not possible.

I'm confused. This seems entirely possible - as far as I'm aware, my Zig implementation does this, and in doing so fails this testcase.

All you have to do is keep the line endings on multi-line strings, that way a \r is represented as follows: >\r> (the carriage return will be displayed differently depending on editor/platform, so it may just look like two lines containing '>', but the key point is that the line ending of all but the last is kept). A similar thing is done with Zig's multiline string literals I believe, except they use \\ instead of >.

So the goals of OS independence and allowing line ending characters in strings are in conflict.

I'm not seeing this.

KenKundert commented 2 years ago

The testcase was constructed assuming that CR was a line ending. If you are failing the test then you are not treating it as a line ending.

There are a couple of situations where what you are doing differs from the current spec. First, if CR is a line ending, it should terminate end-of-line strings and comments:

key: this is a string that contains a CR (^M), it should raise an error if CR is a line ending.

It should raise an error because the carriage return (^M) terminates the line, leaving an extra line that does not have a key. We probably need to add this case to the official test suite.

The second way it differs is that when treated as a line break, the underlying implementation is free to convert the CR to the natural line ending for the environment its running in. So under Unix it would be converted to a LF and on Windows it would be converted to CR LF. Your approach would keep it as a CR.

So my proposal is similar to your implementation, except that it in my proposal CR is no longer treated as a line ending.

fidian commented 3 months ago

I saw something similar when implementing tests for a new JavaScript version of this library. If the intent is to change all newlines to CR internally, that should be documented. If it is to change it to OS-specific newlines, again that should be documented, though the test files would not support this because they assume the user is running under *nix.

I think this issue also bleeds into string_8 because the input JSON has no way to indicate that the dumped form should have DOS-style newlines. (See #11.)

KenKundert commented 3 months ago

A NestedText document is partitioned into lines where the lines are split by CR LF, CR, or LF where CR and LF are the ASCII carriage return and line feed characters. A single document may employ any or all of these ways of splitting lines.

The intent of this statement is that CR LF, CR, and LF are not distinguished from one another when reading a NestedText document. They all simply introduce a new line. Similarly when generating a NestedText document, any of these forms can be used to separate the lines.

The benefit of this approach is that the interpretation of a NestedText document is independent of the OS used to create it and it is very simple to implement.

The drawback of this approach is that NestedText loses the ability to distinguish between NL and CR. That does not seem like a huge loss to me, but I am open to counter arguments.

An alternative would be to simply treat LF and CR LF as introducing newlines, but not a lone CR. In this way NestedText would distinguish CR from NL, but it would lose compatibility with old Macs. As a Linux user, this also does not seem like a huge lost to me, but I don't feel like I am in a position to know how many of those old systems are still in use.

I can fix the issue with string_8.

As for clarifying the issue with CR, do you still feel that I need add something to the documentation to remove the confusion.

LewisGaul commented 3 months ago

The drawback of this approach is that NestedText loses the ability to distinguish between NL and CR. That does not seem like a huge loss to me, but I am open to counter arguments.

See our discussion near the beginning of this issue's comments - not being able to represent certain characters at all seems like a huge problem to me, and the claim that it's not possible to satisfy this doesn't hold as far as I can tell. Again, see previous comments.

KenKundert commented 3 months ago

If we were to distinguish between CR and LF, how would one distinguish a CR that is acting as data and one that acts as a line separator? The only easy way I see to do this is to simply to not allow CR as line separators. I notice that this is the approach taken by Vim, so I would be comfortable with this approach.

LewisGaul commented 3 months ago

If we were to distinguish between CR and LF, how would one distinguish a CR that is acting as data and one that acts as a line separator?

From https://github.com/KenKundert/nestedtext_tests/issues/5#issuecomment-1030662983:

All you have to do is keep the line endings on multi-line strings, that way a \r is represented as follows: >\r> (the carriage return will be displayed differently depending on editor/platform, so it may just look like two lines containing '>', but the key point is that the line ending of all but the last is kept). A similar thing is done with Zig's multiline string literals I believe, except they use \\ instead of >.

KenKundert commented 3 months ago

I don't see how keeping the line endings helps us here.

Consider the following multiline string with Linux line separators and an embedded carriage return:

   > abc \n
   > de \r ef \n
   > ghi

If NestedText treats \r as a line separator, then this example becomes the equivalent to:

   > abc \n
   > de \n
    ef \n
   > ghi

This results in a malformed multiline string.

I don't see any way to resolve this other than to no longer treat \r as a line separator.

fidian commented 3 months ago

I don't often need to embed newlines of any sort in my config files and in other places where I would expect to use NestedText's format. When I first wrote the JavaScript implementation, I captured the newlines associated with each line. When two lines got merged together, I preserved the original newline between those lines. This failed tests, so I had to merge with LF, though the code still captures the newline and could be reverted to put the original one back in.

I regularly see CR+LF and LF in a large amount of places today. CR by itself was only used in Classic Mac OS as far as I am aware, and the last release of that was over 20 years ago. It's unlikely that someone's setting up file exchange between that OS and another with the hopes of using NestedText simply because a parser would need to be written that can operate on that OS and there's very little development in that area. Keeping or dropping support doesn't bother me either way.

I don't see anywhere in the spec that strings are supposed to support every character. The current wording has "String values may contain any printing UTF-8 character."

When I said the documentation was unclear about newlines, I'm talking about this paragraph (slightly trimmed):

... adjacent string items with the same indentation level are combined in order into a multiline string. The string value is the multiline string with the tags removed. Any leading white space that follows the tag is retained, as is any trailing white space and all newlines except the last.

I took the last bit to mean that the newlines in the document are preserved with multiline strings. If the intent is to replace the newlines, the part about "all newlines except the last" should be rewritten. Perhaps something like this?

Any leading white space that follows the tag is retained, as is any trailing white space. The last newline is removed and all other newlines are converted to the default newline for the current operating system.

KenKundert commented 3 months ago

I not seeing a way to reliably support CR characters. If CR is considered a line termination character, in deference to the old Macs, then it will always act to terminate the line. But even if CR is not considered a line termination character, a CR at the end of the line on a Linux machine results in CRLF, which will also act to terminate the line.

I believe the current approach of treating CR, LF, and CRLF as line terminators is the only approach that provides consistent behavior with carriage returns.

KenKundert commented 3 months ago

@fidian I will make your recommended change to the description of multiline strings and keys.

LewisGaul commented 1 month ago

not being able to represent certain characters at all seems like a huge problem to me

I stand by this, so I'm concerned by the resolution here.

KenKundert commented 1 month ago

I don't think this is a resolvable problem unless NestedText just punts on specifying the line ending behavior and leaves it to the implementations.

The way Python handles this problem is that it allows the file open function to take a newline argument. If it is not specified or specified as None, Python treats \r, \r\n, and \n all as newlines. Otherwise newline explicitly specifies the expected newline character sequence, which may be "\r", "\r\n" or "\n".

Currently NestedText uses Python's default behavior, which it refers to as universal newlines. We could take the same approach: nominally NestedText accepts \r, \r\n, and \n all as newlines but this behavior can be overridden by the implementation.

LewisGaul commented 1 month ago

That's an interesting comparison, but in python if you have a file that contains the string "CR=\r LF=\n CRLF=\r\n" then you can read these characters unaffected using open(f, newline="").read(). On the other hand, as mentioned, there is currently no way to represent this string in NestedText. Every single UTF-8 character is supported except for \r and \n that are interpreted depending on platform. I can't imagine this being true of any other data representation language.

>>> import json, yaml, toml, nestedtext
>>> s = 'CR=\r LF=\n CRLF=\r\n'

>>> json.dumps(s)
'"CR=\\r LF=\\n CRLF=\\r\\n"'
>>> assert json.loads(_) == s

>>> yaml.dump(s)
'"CR=\\r LF=\\n CRLF=\\r\\n"\n'
>>> assert yaml.safe_load(_) == s

>>> toml.dumps({"": s})
'"" = "CR=\\r LF=\\n CRLF=\\r\\n"\n'
>>> assert toml.loads(_)[""] == s

>>> nestedtext.dumps(s)
'> CR=\n>  LF=\n>  CRLF=\n>'
>>> assert nestedtext.loads(_, top=str) == s
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError

My Zig implementation supports any newline character (by preserving the used newlines):

$ cat -v newlines.json
"CR=\r LF=\n CRLF=\r\n"

$ ./zig-out/bin/nt-cli -f newlines.json -F json -O nt > newlines.nt

$ cat -v newlines.nt
> CR=^M>  LF=
>  CRLF=^M
>
$ ./zig-out/bin/nt-cli -f newlines.nt -F nt -O json | cat -v
"CR=\r LF=\n CRLF=\r\n"

Admittely this does make the file a mess though! To be honest if anything is a case for some kind of escaping this might be it... as I also suggested at https://github.com/KenKundert/nestedtext/issues/50#issuecomment-2348820324.

fidian commented 1 month ago

In my experience, the result of mixing the newlines is ... mostly nothing. git will sometimes produce warnings but most of the time nobody knows or cares which newline is being used. It's exceedingly rare where the type of newline is important; when it is, I often see it as hex or escaped because it is non-printing. This could lead us into the realm of hard spaces, different types of tabs, and other non-printing nuances.

When I want to use a format like NestedText, I want to sidestep the newline issue. In a mixed-OS work environment with developers of different skill levels and automated tools touching the code, caring about the specific newlines being used is a headache. Avoiding that headache would be a priority of mine to make the code/configuration more manageable.

KenKundert commented 1 month ago

I think the issue is not whether carriage returns should be supported, but rather how they can be supported. With the multiline string syntax and by restricting leaf value to only be strings, NestedText eliminated the need for quoting and escaping. As such it can support all characters except newline characters, which act as line separators and not data.

@LewisGaul, you are giving examples of how the various end-of-line characters are supported in other languages, but those languages support character escaping, which NestedText does not have. As long as we are unwilling to add escaping to NestedText, those examples from other languages are not helpful.

Given that we would like to support all characters in NT, the current situation where CR, CRLF and NT are all taken to be the same is clearly undesirable. But, on the other hand, as @fidian points out, in most cases people have no need to distinguish between the various forms of newlines.

As I see it, we have a few choices.

We can continue to treat all three forms of newlines as equivalent, in which case we cannot distinguish between them.
We can treat lone CR as the data and only consider CR-LF and LF as newline characters. This works for Linux, Windows, and MacOS files, but abandons relic pre-MacOS Mac files. This seems like a small sacrifice.
We can leave it to the implementation as to what forms of newline to support with NestedText only recommending either 1 or 2, but not demanding it.

2 can be seen as a slight improvement over 1 in that it allows isolated CR to be treated as data, but end-of-line CR would disappear as they would taken to be a CR-LF end-of-line marker.

I like 1 more than 3 because it standardizes all NestedText files. Specifying the newline to the reader potentially allows either CR or LF to be treated as data in special cases, but the resulting NT files would not fit the standard. In other words, a single NestedText files could have 4 different interpretations and the interpretation would depend on what was passed to the newline argument of the reader.

Going with 1 does not preclude supporting CR and NL in NestedText. Once can always implement escaping in the schema.

LewisGaul commented 1 month ago

@LewisGaul, you are giving examples of how the various end-of-line characters are supported in other languages, but those languages support character escaping, which NestedText does not have. As long as we are unwilling to add escaping to NestedText, those examples from other languages are not helpful.

My point was that it's unheard of to use utf-8 but only support representing a subset of characters. This seems like a complete no-go to me, leaving the format as incomplete. The one data type (text) does not have support for its full domain!

I was also showing how it is possible to support this in nestedtext, although I can understand wanting to consider alternatives.

Once can always implement escaping in the schema.

Yes, or in the load implementation. I'm not entirely against making it a part of the implementation rather than the language spec, but if going that route I'd suggest it would be helpful for the spec docs to point out that it's highly recommended for implementations to support this, as otherwise it's impossible to represent certain characters. Having said that, it's not ideal that a nestedtext file passed from one person to another could be treated differently based on whether escapes are interpreted or not...

KenKundert commented 1 month ago

@LewisGaul your proposal effectively makes the newlines both data and end-of-line markers. This seems like a double-edge sword. Some times you might want the newlines to be data, but sometimes you might not. Consider the case where a NestedText files is created on a Windows machine and is read on a Linux machine. The Windows machine will place carriage returns before each newline, and when read on the Linux machine those carriage returns will show up at the end of every line of a multi-line string. This seems like we are trading a small obscure problem for a much bigger problem.

LewisGaul commented 1 month ago

I'm not sure I'd really propose what I have in zig-nestedtext, it's very unreadable. I just implemented it that way assuming that was the intended behaviour, since this "small obscure problem" is a complete dealbreaker from my perspective. More than happy to discuss unicode escapes instead.

KenKundert commented 1 month ago

Adding escaping would undermine one of the primary uses of NestedText: structured code.

My feeling is that if people really need carriage returns, they can implement the \r escapes in the schema.

KenKundert / nestedtext_tests

Incorrect 'dump_out.nt' testcase string_multiline_12? #5