Missing testcases, suggested improvements

LewisGaul commented 3 years ago

Hey :) I've got all but the following testcases passing in zig-nestedtext at this point:

    "dict_02", // ? (dumping keys with newlines)
    "dict_03", // Key quoting - to be removed from spec
    "dict_05", // Root-level leading whitespace (bug...)
    "dict_17", // Key quoting - to be removed from spec
    "dict_21", // Unrepresentable
    "dict_22", // Unrepresentable
    "dict_23", // Key quoting - to be removed from spec
    "empty_1", // Bad testcase - empty file maps to null??
    "holistic_1", // Key quoting - to be removed from spec
    "list_5", // Root-level leading whitespace (bug...)
    "string_multiline_07", // Root-level leading whitespace (bug...)

I have some questions on minor details of the spec, as well as a few edge cases that I don't think are currently covered by the testsuite:

The spec says "Comments are lines that have # as the first non-space character on the line.", but is there any reason to disallow tabs before the # given the indentation is insignificant?
Indentation: "An increase in the number of spaces in the indentation signifies the start of a nested object. Indentation must return to a prior level when the nested object ends." - should this be more specific in saying that the indentation must return to exactly the depth of a previous indentation? Here is a testcase that I think should be added, expecting an error (I don't think I handle this yet):
```
a:
foo:
  > 1
bar: 2
```
Line endings: "lines are split by CR LF, CR, or LF... A single document may employ any or all of these ways of splitting lines." - this is very clear, and makes sense given the line ending on multiline strings is incorporated into the value. I know there's at least one testcase testing a CR line ending (this caught a bug in my implementation!), but I'm unsure if there's testing of all three possibilities (e.g. \r\n should be treated as a single line ending - relevant for reporting line number in errors).
Dumping newline characters (dict_02) - the spec says nothing about how to dump data structures to NestedText, only how to parse a NestedText file, which makes sense. I'm not sure it should be an error to dump an object with a newline in the key, e.g. see Python's json module:
```
>>> json.dumps({"\n":1})
'{"\\n": 1}'
```
I'm not entirely clear on the need for disallowing root-level whitespace, in fact I can't see a mention of it in the spec, but the tests seem to enforce it (dict_05, list_5, string_multiline_07).
How should other unicode whitespace characters be treated? (other than space, tab, CR and LF)

LewisGaul commented 3 years ago

Indentation: "An increase in the number of spaces in the indentation signifies the start of a nested object. Indentation must return to a prior level when the nested object ends." - should this be more specific in saying that the indentation must return to exactly the depth of a previous indentation? Here is a testcase that I think should be added, expecting an error (I don't think I handle this yet):

Apologies, I just noticed that dict_10, dict_11 and string_multiline_05 testcases cover this. The suggestion about clarifying this slightly in the spec still stands though :)

Dumping newline characters (dict_02) - the spec says nothing about how to dump data structures to NestedText, only how to parse a NestedText file, which makes sense. I'm not sure it should be an error to dump an object with a newline in the key, e.g. see Python's json module:
>>> json.dumps({"\n":1})
'{"\\n": 1}'

I've realised that this round-trips in JSON (json.loads('{"\\n": 1}') == {"\n":1}), but in NestedText it will be interpreted literally:

>>> nestedtext.loads('\\n: 1')
{'\\n': '1'}

This relates to the discussion in https://github.com/KenKundert/nestedtext/issues/23 about being able to represent arbitrary data - with this in mind I can see the argument for supporting multiline keys. This is reflected very concretely when comparing the Zig JSON dump API to what's currently needed for the NestedText dump API - the former cannot return an error but the latter can. (It would be a better user experience if it was made impossible to represent invalid NestedText within the program, but this would involve checking contents of object keys, which would presumably have a performance hit.)

KenKundert commented 3 years ago

The spec says "Comments are lines that have # as the first non-space character on the line.", but is there any reason to disallow tabs before the # given the indentation is insignificant?

No, I have changed the line to refer to the "first non-white-space character on the line".

Indentation: "An increase in the number of spaces in the indentation signifies the start of a nested object. Indentation must return to a prior level when the nested object ends." - should this be more specific in saying that the indentation must return to exactly the depth of a previous indentation?

Seems like the same thing to me, and the former is more succinct.

Line endings: "lines are split by CR LF, CR, or LF... A single document may employ any or all of these ways of splitting lines." - his is very clear, and makes sense given the line ending on multiline strings is incorporated into the value. I know there's at least one testcase testing a CR line ending (this caught a bug in my implementation!), but I'm unsure if there's testing of all three possibilities (e.g. \r\n should be treated as a single line ending - relevant for reporting line number in errors).

I'll look and try to add something.

Dumping newline characters (dict_02) - the spec says nothing about how to dump data structures to NestedText, only how to parse a NestedText file, which makes sense. I'm not sure it should be an error to dump an object with a newline in the key.

Now, with multi-line keys, this is no longer an issue.

I'm not entirely clear on the need for disallowing root-level whitespace, in fact I can't see a mention of it in the spec, but the tests seem to enforce it (dict_05, list_5, string_multiline_07).

I have added a line to the language reference section indicating that the root level object must begin in column 1. We can always relax this if a strong need appears.

How should other unicode whitespace characters be treated? (other than space, tab, CR and LF)

Unicode white space and tabs are not allowed in indentation except on blank lines and comment lines. They also cannot be part of a tag. All white space is allowed anywhere else, though white space is ignored if it follows an embedded key or if it leads or tails strings in inline dictionaries and lists. Otherwise all white space, including Unicode white space, is retained unmodified.

KenKundert commented 3 years ago

string_8 tests dos line endings (CR LF) string_9 tests mac line endings (CR) all the other tests use unix line endings (LF)

LewisGaul commented 3 years ago

Thanks, all sounds good. I'll reassess where I'm at with the testsuite after implementing the new changes to the language spec :)

RE:

Unicode white space and tabs are not allowed in indentation except on blank lines and comment lines. They also cannot be part of a tag. All white space is allowed anywhere else, though white space is ignored if it follows an embedded key or if it leads or tails strings in inline dictionaries and lists. Otherwise all white space, including Unicode white space, is retained unmodified.

This sounds like the sensible choice, basically grouping 'special unicode whitespace' in with tabs. I'm not an expert on unicode, but I get the impression (e.g. from https://stackoverflow.com/a/14245311/5181656) that defining which characters consitute unicode whitespace may not be trivial (and may change over time as characters are added). I'm guessing when we're talking about 'unicode whitespace' we're referring to those in the 'Zs' category?

KenKundert commented 3 years ago

To be honest, I have not been focused on defining the white space character set. I just use what ever matches \s in a Python regex. The only time it really matters is in deciding what to ignore, and generally those should be spaces and tabs. I could tighten up the language by converting from \s to [␣→] so that I only ignore spaces and tabs. That would eliminate some uncertainty.

LewisGaul commented 3 years ago

Ok, I'm not too worried about it, just thought it might be good to have a well-defined policy on unicode whitespace in the spec. I think it probably makes sense to treat it the same way as tabs - I think the main place this would help people is by explicitly erroring on weird indentation containing whitespace that makes things look aligned differently to how they actually are.

Thinking about it, I think to handle unicode whitespace I'd have to parse files as unicode rather than just as bytes like I currently do... But I should probably be doing this anyway since the expected encoding is utf-8!

KenKundert / nestedtext_tests

Missing testcases, suggested improvements #3