Proposed changes to NestedText that are not backward compatible

KenKundert commented 3 years ago

We are considering deprecating quoted keys. This will be a change that is not backward compatible. You can see the discussion that triggered this decision here. To summarize, the feeling is that:

Quoted keys add considerable complexity to both the implementation and support (ex: they added considerable complexity to implementation of Vim syntax highlighting) that is not consistent with the design philosophy of NestedText.
The approach taken is unique and unfamilar to everyone that encounters it.
Distinguishing the key from the value can be difficult in some cases.
The approach taken is not in keeping with the other concepts of NestedText.
Even with quoting, there are some strings that cannot be used as keys.
Quoted keys provide too little value given the above issues.

Eliminating quoted keys further limits the strings that can be used as keys. We are considering adding multi-line keys to replace quoted keys. It is felt that multi-line keys are more in keeping with the style of NestedText than quoted keys were, and they allow NestedText to accept any string as a key.

Multi-line keys are patterned after multi-line strings, except the string tag >␣ is replaced by the dict tag :␣ and a trailing indented value is required. For example:

: this is the first line of a multi-line key
: this is the second line
    > this is the value

This would be interpreted as:

{
    "this is the first line of a multi-line key\nthis is the second line": "this is the value"
}

Multi-line keys are not expected to be commonly used, but they are being considered because the fit naturally in the language and they make NestedText completely general, meaning that with multi-line keys NestedText can handle any combination of lists, dictionaries, and strings, where the leaf values are all strings. We could not say that previously.

Comments?

LewisGaul commented 3 years ago

Thanks for raising this issue for discussion - I'd be interested to hear other people's thoughts on this.

Summarising my position that I think I've already made clear in comments of #21 and #22:

I'm happy to see quoted keys going away
I'm not personally convinced by the need for supporting whitespace/colons in keys
I think your proposed multiline key syntax has some nice properties (e.g. each line type still identifiable without the context of surrounding lines)
However it does seem strange that the key is appearing after a colon, making it appear like a value at first glance
I just looked up multiline keys in YAML and found that it is supported but uses a leading ? (according to a couple of stackoverflow answers at least, e.g. here). Maybe using ? instead of : to start lines intended as multiline keys would make more sense for consistency?

joshgoebel commented 3 years ago

Questions:

Is it a syntax error if a key is a single line? I would presume not since this can also be used for pure escaping, and one might wish to escape a single line?
If not then the following would be completely valid as well, yes?

: president
    : name 
        > Katheryn McDaniel
    : phone
        cell: 1-210-555-5297
        home: 1-210-555-8470
    : kids
        - Joanie
        - Terrance

KenKundert commented 3 years ago

Josh, a single line multi-line key is fine, so your example is perfectly valid. In fact, a multi-line key can be empty (this is the only way to get an empty key):

:
    > this value has an empty key

which becomes {"": "this value has an empty key"}.

asb commented 3 years ago

To share my 2 cents:

I think I have a similar point of view to @LewisGaul. Allowing a wider range of keys to be used sounds attractive, but I'm not sure the extra "oddness" of multiline keys is worth it. They're easily machine parseable, but I think potentially confusing to the human reader. Using a different prefix to : might be slightly better, but I'd still lean towards not adding it at all.

kalekundert commented 3 years ago

I'm late to this thread, but I thought it would be helpful to list some pros and cons (as I see them) of each proposed syntax, including one that @KenKundert and I have discussed but not mentioned yet:

: syntax:

Example:
```
: lorem
: ipsum
> dolor sit amet
```
Pros:
- The : character is clearly and consistently associated with dictionaries (like - with lists and > with strings).
Cons:
- The keys look like values at first glance.
< syntax (this is the one that hasn't been mentioned yet):

Example:
```
< lorem
< ipsum
> dolor sit amet
```
Pros:
- Symmetry with the multiline string syntax. You can also think of > as pointing to the right for values, and < pointing to the left for keys.
Cons:
- The keys look like strings at first glance. I actually think it's a bit hard to tell that there's a dictionary there at all.
? syntax:

Example:
```
? lorem
? ipsum
> dolor sit amet
```
Pros:
- Keys are not confused with values/strings.
- Precedent in YAML.
- A mnemonic could be that keys are for looking up values, which is kinda like asking a question, so it kinda makes sense to prefix keys with ?.
Cons:
- ? does not immediately suggest "dictionary" like : does (the above mnemonic notwithstanding).
- Feels a little wrong to add ? as a syntax character given that it will be so rarely used compared to all the other syntax characters.
No multiline keys:

Pros:
- No potential for confusion.
Cons:
- Some keys can't be represented.

I personally like the idea of adding a multiline key syntax, although I definitely appreciate the arguments against doing so. The ? syntax is growing on me more and more, too. I do think that we should at least make empty keys illegal, to keep open the possibility of adding the : syntax without breaking backwards compatibility again.

LewisGaul commented 3 years ago

Thanks for the summary, I'd agree with the pros/cons. I'm still -1 for multiline keys in general, looking at the options above.

I struggle to visually parse any of these as mappings, since I think YAML/JSON always require a colon separating the key and the value, which is missing from all of the above.

I think perhaps I'd be slightly less against it if the separating colon was added, e.g. (picking my preferred syntax from the options above):

? lorem
? ipsum
  : dolor sit amet

However, this doesn't play nice with values that aren't a simple single-line string:

? key with : unrestricted characters
  :
    > multi
    > line

I think maybe I still prefer this to a lack of requiring the colon though.

In fact, if the colon is still required, I guess in theory the keys could just be represented as regular multi-line strings (with >), but you'd be losing the "line type is context independent" invariant.

KenKundert commented 3 years ago

Okay, things seem like they have settled down. I have implemented some of the suggestions and checked them in. More work to be done, but just on completing these changes. No additional features are being considered.

The new version:

removes quoted keys
adds multi-line keys
adds inline lists and dictionaries

Details are in the documentation.

Example:

: key 1: the first key
    [[11, 12, 13], [21, 22, 23], [31, 32, 33]]

: key 2: the second key
    {alpha: α, beta: β, delta: δ, omega: ω, pi: π, tau: τ}

becomes

{
    "key 1: the first key": [
        ["11", "12", "13"],
        ["21", "22", "23"],
        ["31", "32", "33"]
    ],
    "key 2: the second key": {
        "alpha": "α",
        "beta": "β",
        "delta": "δ",
        "omega": "ω",
        "pi": "π",
        "tau": "τ"
    }
}

My expectation is that these new features will not be heavily used, but would be very helpful on occasion and helps to complete the language. With these changes, NestedText becomes capable of handling any hierarchical combination of lists, dictionaries and strings.

Please try it out and give me your impressions.

-Ken

LewisGaul commented 3 years ago

Thanks for this work! Although it does add some complexity to the language (albeit removing the complexity of quoted keys) I think this is a positive set of changes, as you say giving more completeness to the language.

I'll have a go at implementing these changes in zig-nestedtext at some point (maybe next weekend) and provide any feedback I might have.

One minor point - shouldn't the new version be 1.4.0 rather than 1.3.2, given the backwards incompatibility and size of the changes?

KenKundert commented 3 years ago

My version numbers are interpreted as follows: <major>.<minor>.<patch>. Stable releases are those where <patch> is 0. Those get pushed on to pypi. Anything with a nonzero patch is considered tentative and subject to change. They are for developers and only found on GitHub. The major component of the version number advances with changes that are not backward compatible in some significant way. Since this version loses quoted keys and so represents a change that is not compatible with version 1.3, it will eventually become 2.0.0 rather than 1.4.0.

torresjrjr commented 3 years ago

I think the inline lists and dictionaries features add adding huge complexity to an already near-perfect format. I'm personally strongly against those features. It breaks expectation, which simply kills the format for me. If I wanted those feautres, I'd use yaml.

No-quoted-keys are a good idea. They vastly reduce the complexity.

As for the multiline keys, I don't find them a bad idea. However, they solve what I consider are non-problems.

Currently, unquoted keys cannot contain colons, so multilined-keys are invented to allow colons and much more. But we already constrain keys with whitespace-trimming rules. I think it would be much more sensible to then also disallow ': ' and '\n' character sequences.

With NestedText, their is already a precedent that validation and schemas are left to the developer. I think setting the expectation for developers to remove ': ' and '\n' character sequences from their keys is better than allowing them. It's probably not good to encourage those characters sequences in the first place. It could be considered bad practice. If they are necessary, there is the meta-solution of replacing them with literals like '\n\r\t\x0a\u0123'.

Notes that ':' would still be allowed afaik, meaning you could use URLs as keys, like in json-ld.

kalekundert commented 3 years ago

Would you mind elaborating on the complexity that inline lists and dicts would add? From my perspective, they add a moderate amount of complexity to the parser, but very little complexity for the end user. The syntax is intuitive (i.e. common to many programming languages) and maintains the property that the type of each line can be identified from its leading characters.

Regarding YAML, I think its problem is not that it supports inline lists/dicts, but that its rules for quoting and type-casting are complicated and frequently break expectations (to use your phrase). But I don't see how the proposed changes to NT would break expectations in a comparable way.

I do think that there are advantages and disadvantages to the inline syntax, I just think the former outweigh the latter (see #24 for more discussion):

Advantages:

Discourages the use of ad-hoc inline list/dict formats, like ','.split(). These are bad for a bunch of reasons, including (i) the developer needs to anticipate the desire for inline syntax, (ii) the syntax between different fields may not be consistent, (iii) nesting is not trivial to support, etc.
Allows authors more control over the "density" of information in the file. Normal NT files tend to be long and narrow, which reduces the amount of information that can fit on the screen and makes it harder to grok the whole file. Inline data structures could certainly be abused to go too far in the other direction, but on balance I think they give authors a useful tool to improve readability.

Disadvantages:

Adds some complexity to the spec. I think the added complexity is minor, for the reasons given above, but I'd be happy to hear more about this.
Makes it more difficult to load a file and dump it unchanged. This is already made non-trivial by comments and blank lines, though, so the increase in difficulty is relatively small.
Conflicts visually with common templating languages. I saw this argument here, but I think it's a very minor consideration. You can always change the templating delimiters if it's too much of a problem.

Ambiguous:

Allows empty data structures to be represented. I'm on the fence about this one. On one hand, the inability to represent such structures is a conspicuous gap in the current spec (i.e. it's one of the most common questions we get). On the other hand, I'm sympathetic to the view that a schema should be used to interpret empty strings as empty list/dicts as appropriate (since that puts the least burden on the author to specify type). But I'm also sympathetic to the view that the author should be on the hook for specifying list/dict types, since those are the only types that NT actually understands.

LewisGaul commented 3 years ago

I've tried out the new multiline object key syntax, and have come up with some edge cases that make me question the re-use of :.

As a user, what would you read the following as?

A :
 : B:
: C :
D :
: E

I think this could be interpreted in quite a range of ways, but as I understand it this is actually {"A": {"B:": ""}, "C :": "", "D": "", "E": ""}

I'd still be in favour of using ? instead of : for multiline keys. Perhaps this could also be improved slightly by requiring the object value be given explicitly when using the multiline key syntax? An empty string can be represented using >:

A :
 : B:
  >
: C :
 >
D :
 >
: E

Also, there seems to be a bug(?) in the Python implementation that disallows starting the file with a multiline key, giving a "content must start with key" error.

Slight nitpick: the latest language reference has a section headed 'Inline Objects' that's talking about both inline objects and lists. I understand objects are being referred to as 'dictionaries' but this is quite a pythonism. E.g. my understanding is that JSON (JS Object Notation) refers to key-value pairs as 'objects'.

KenKundert commented 3 years ago

I was going to respond to this on Github, but I cannot find it. Did you delete your post?

There was not a lot of thought that went into picking '>' as the string tag. We considered '|' briefly and decided to go with '>'. There was no a good reason for choosing one versus the other. It seemed like '>' had more use as a quoted string character. As you say, it was heavily used as such in email messages.
That led to syntax highlighting support in editors. So it seemed like the better choice. But fundamentally the choice was largely arbitrary. I think revisiting that choice would not be a good idea at this point.

-Ken

On Sat, May 01, 2021 at 02:04:55AM -0700, tototest99 wrote:

Hello and thank you very much for your work on NestedText as a human friendlier format. I’m very sorry to discover this project this late and to ask the kind of question I’m about to, on something which is probably settled since a long time, but is there an archive of thoughs process on the choice of > as multiline string marker? Especially in comparison to |. My reasoning was that > is very much used in emails or forums where it marks a response, with the symbol being oriented laterally, while something like | especially when chained:
address:
    | 2586 Marigold Lane
    | Topeka, Kansas 20682
seems IMHO a more natural barrier forming symbol to mark a simple block, from a human cognition / reading point of view (and especially in a text editor with less vertical spacing between the |s. This is a very niptick point and I will probably adopt NestedText for a project or two, but you know, just in case things where not engraved in stone, I sputtered the idea. Thanks again for this project. PS: not needing to quote keys is a good idea.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/KenKundert/nestedtext/issues/23#issuecomment-830584998

tototest99 commented 3 years ago

Yes, as it was too unimportant.

nevertheless, thank you for your response!

KenKundert commented 3 years ago

Lewis, Thanks for the bug report. I have fixed the issue. I have also removed references to inline objects.

I think your point about using colon to identify two distinct situations being confusing in some cases is a good one that we did not consider. It is worth considering using a completely different character for multiline keys.

KenKundert commented 3 years ago

Lewis, I went back and read your message more closely and found I had missed an important point. Your example is not legal NestedText. Specifically:

A :
 : B:
: C :
D :
: E

is not allowed because all keys must be paired with values. In fact, this is the suggestion you made. My implementation was accepting multiline keys without value, which probably caused the confusion, but it is now fixed.

LewisGaul commented 3 years ago

Thanks for clarifying that, I was indeed misled by the Python implementation - I've fixed my Zig implementation now (thanks for the tests update).

I've given some thoughts to a few alternative syntax options and included some notes below. I think requiring the value after a multiline key makes it a lot clearer though, so don't have particularly strong feelings on the options below - just thought it might be useful listing out some possible options :)

On the question of the character to indicate key lines, maybe a fairer variant of the above to consider would be as follows (where all colons are syntax).

A :
 : B
  >
: C
 >
D :
: E
: F
 >

JSON: {"A": {"B": ""}, "C": "", "D": "", "E\\nF": ""}

With question marks for object keys, for comparison:

A :
 ? B
  >
? C
 >
D :
? E
? F
 >

Also comparing an alternative I suggested briefly a while back, reusing regular multiline strings (where there is always exactly one colon separating the key and the value):

A :
 > B
  :
> C
 :
D :
> E
> F
 :

I just tried converting the above example to yaml (see below), and it does involve a question mark, but looks pretty odd, so either way I think NestedText will be doing better here!

A:
  B: ''
C: ''
D: ''
? 'E

  F'
: ''

Finally, comparing the three alternatives above with a more normal example (taken from the holistic_1 testcase)...

Status quo:

: - key1:
    > #value1:
:  #key2
    > multi
    > line

JSON: {"- key1:": "#value1:", " #key2": "multi\nline"}

Question marks for object keys:

? - key1:
    > #value1:
?  #key2
    > multi
    > line

Reusing multiline string syntax (showing the downside of requiring an extra level of indentation with just a colon sitting alone at an indent level!):

> - key1:
    : #value1:
>  #key2
    :
        > multi
        > line

KenKundert commented 3 years ago

Thanks for illustrating the alternatives. Currently we are expecting to choose between the leading colon or question mark to introduce key items. We are not considering your third alternative of reusing mutltiline strings for keys. I think it is likely that we will stay with the leading colon for multiline keys.

LewisGaul commented 3 years ago

Quick question on inline lists/objects: is [foo,] treated as ["foo"] or ["foo", ""]? I ask because the spec says

[,] is a list that contains a single empty string

but I would have read this as two items (separated by the comma). Allowing a trailing comma doesn't seem like a good fit when the inline structures must be on a single line. Is there really a need to allow specifying empty strings in inline structures given this confusion?

LewisGaul commented 3 years ago

I've now fully implemented inline lists and objects in zig-nestedtext, as well as multiline object keys (using a leading colon), which you can try out by downloading the nt-cli binary for converting to/from JSON at https://github.com/LewisGaul/zig-nestedtext/releases/tag/v0.2.0a.

The one deviation from the spec I currently have is that I disallow empty keys/values in inline objects/lists, and I'd like to see some discussion on this. My reasoning for this:

Doesn't make sense to allow and ignore a trailing comma when inline containers must be entirely on a single line.
In combination with the above, [,] should be a list of two empty strings, and of course [] an empty list, so there'd be no way to represent a list of one empty string, which would be odd.
The cleanest solution seems to be just to disallow empty keys/values. This doesn't restrict the values that can be represented since inline syntax is just a convenience (plus allowing representing empty containers).
I really wouldn't expect empty keys/values to be very common, especially in a container with other values (and in absence of other values the benefit of inline syntax is reduced anyway).

KenKundert commented 3 years ago

Empty inline strings must be followed by a comma to be recognized. For example, [] is an empty list and [,] is a list that contains a single empty string.

As you recognize, this is what allows us to distinguish between empty lists and those with a single empty string. Supporting empty lists and empty dictionaries is considered desirable because it provides a completeness to the language. With this it is now possible to represent any hierarchical combination of lists, dictionaries, and strings. Completeness increases ease of use because it eliminates exceptions that must be handled by user code that calls the dump functions.

It is unusual to use terminal commas on single line lists, but they are only required to identify empty terminal values. Other than that, while they may be unusual, they have become common on multiline lists and there is really no reason to outlaw them.

In my view, this behavior is a net positive.

LewisGaul commented 3 years ago

Supporting empty lists and empty dictionaries is considered desirable because it provides a completeness to the language.

Completely agree.

It is unusual to use terminal commas on single line lists, but they are only required to identify empty terminal values. Other than that, while they may be unusual, they have become common on multiline lists and there is really no reason to outlaw them.

This is where I disagree, from the perspective of ease of reading and understanding NestedText. A trailing comma seems completely unintuitive here to me, making it look like there's an empty string at the end after the comma (which there isn't). You say they've become common on multiline lists - I would agree, but strictly for multiline, not for single line (and in fact trailing commas are entirely disallowed in JSON).

Some examples: [1, 2, , 4] -> ["1", "2", "", "4"] [1, 2, 3, ] -> ["1", "2", "3"] [1, 2, 3, ,] -> ["1", "2", "3", ""]

In a way it's not as bad for objects since the colon is required as well as the comma, although it's then unclear whether a trailing comma should be needed if the last item contains an empty value... {a: 1, :, c:, :4, :} -> {"a": "1", "": "", "c": "", "": "4", "": ""}

That :4 actually reminds me of the multiline object key syntax, which makes it look a bit like a lone key to me...

I'm really not seeing the argument for allowing this though, seeing as it wouldn't be restricting the language at all to disallow. Inline object/list syntax is already one of the most complicated bits of the NestedText syntax, and allowing empty values just adds to potential mental overhead in trying to read NT files.

I also just noticed that it doesn't seem to be possible to use inline syntax at the root level in your Python implementation, is that intentional?

KenKundert commented 3 years ago

Once you allow terminal commas on multiline lists, it would be inconsistent to not allow them on single line lists. For example, Python allows lists, tuples, argument lists, etc to have terminal commas in both cases. Specifically, [1,2,3,] is a valid alternative to [1,2,3].

I don't think terminal commas impose a significant mental load on the user. One always needs to become familiar with the ways that a language works. This detail is one that most users will never encounter, and when they do, it is not hard to grasp; nor it is outside the norm. The only thing about NestedText that is different from other languages that allow terminal commas is that values can be completely empty. So it is a little different from Python, but not in a way that is artificial or confusing. I was able to explain it in one sentence in the documentation, and you easily recognized the problem it was designed to address.

When I said that it is increasingly common for languages to support terminal commas, I was extrapolating from the fact that Python has supported them for a very long time and the fact that a lot of people complain that JSON does not. So I assumed that they were common in other languages, but I don't actually know that to be true. Actually, Python has a very similar situation and rule for tuples (tuples are immutable lists). An empty tuple is represented with () or (,), a tuple with one element is represented with (a,), a tuple with two elements is represented with (a,b) or (a,b,). So, terminal commas are accepted for all tuples, and required for one element tuples. Single element tuples are treated special because (a) is a valid expression that evaluates to a; the comma is needed to distinguish the expression from the tuple.

The Python implementation allows inline lists and dictionaries at the top level. Be sure to specify top=any to the argument list to load or loads, otherwise the top-level is restricted to dictionaries.

LewisGaul commented 3 years ago

Once you allow terminal commas on multiline lists, it would be inconsistent to not allow them on single line lists.

Sure, but NestedText doesn't allow multiline lists.

The only thing about NestedText that is different from other languages that allow terminal commas is that values can be completely empty.

Thinking about it, I'm against empty values being allowed in the inline list/object syntax independent of this discussion about trailing commas (although if you remove empty values and you remove any need for allowing trailing commas). As you point out, there may be precedent for trailing commas (albeit as an extension of mutliline structures), but I'm not aware of precedent for empty values in this kind of syntax - even YAML disallows this! Things like {:[,,],} just look a bit odd to me.

I was able to explain it in one sentence in the documentation, and you easily recognized the problem it was designed to address.

It's not hard to explain, it just feels like a gotcha that needs explaining, rather than being obvious at first sight.

The last thing I have to say about this is that it would be much easier to later add support for empty values if a need arises than to deprecate support for empty values if it turns out to be a disliked feature. I would very much like to hear other people's input on this, and have slight concern about the possible addition of unnecessary complexity/ambiguity when reading this data format.

KenKundert commented 3 years ago

Its been almost three weeks since the last comment, so I think its is time to make the call. I have been living with the proposed changes as implemented in the current version on github and am comfortable with that version. Compared to version 1.3 it:

removes quoted keys
adds multiline keys
adds single line lists and dictionaries It also uses the convention in single line lists that an empty string must be followed by a comma to be recognized.

Unless I hear any final comments, I will be re-releasing the current version as the next stable release, version 2.0, in the next few days.

KenKundert commented 3 years ago

Version 2.0 has been released.

KenKundert / nestedtext

Proposed changes to NestedText that are not backward compatible #23