KenKundert / nestedtext

Human readable and writable data interchange format
https://nestedtext.org
MIT License
362 stars 13 forks source link

Proposed changes to NestedText that are not backward compatible #23

Closed KenKundert closed 3 years ago

KenKundert commented 3 years ago

We are considering deprecating quoted keys. This will be a change that is not backward compatible. You can see the discussion that triggered this decision here. To summarize, the feeling is that:

  1. Quoted keys add considerable complexity to both the implementation and support (ex: they added considerable complexity to implementation of Vim syntax highlighting) that is not consistent with the design philosophy of NestedText.
  2. The approach taken is unique and unfamilar to everyone that encounters it.
  3. Distinguishing the key from the value can be difficult in some cases.
  4. The approach taken is not in keeping with the other concepts of NestedText.
  5. Even with quoting, there are some strings that cannot be used as keys.
  6. Quoted keys provide too little value given the above issues.

Eliminating quoted keys further limits the strings that can be used as keys. We are considering adding multi-line keys to replace quoted keys. It is felt that multi-line keys are more in keeping with the style of NestedText than quoted keys were, and they allow NestedText to accept any string as a key.

Multi-line keys are patterned after multi-line strings, except the string tag >␣ is replaced by the dict tag :␣ and a trailing indented value is required. For example:

: this is the first line of a multi-line key
: this is the second line
    > this is the value

This would be interpreted as:

{
    "this is the first line of a multi-line key\nthis is the second line": "this is the value"
}

Multi-line keys are not expected to be commonly used, but they are being considered because the fit naturally in the language and they make NestedText completely general, meaning that with multi-line keys NestedText can handle any combination of lists, dictionaries, and strings, where the leaf values are all strings. We could not say that previously.

Comments?

LewisGaul commented 3 years ago

Thanks for raising this issue for discussion - I'd be interested to hear other people's thoughts on this.

Summarising my position that I think I've already made clear in comments of #21 and #22:

joshgoebel commented 3 years ago

Questions:

: president
    : name 
        > Katheryn McDaniel
    : phone
        cell: 1-210-555-5297
        home: 1-210-555-8470
    : kids
        - Joanie
        - Terrance
KenKundert commented 3 years ago

Josh, a single line multi-line key is fine, so your example is perfectly valid. In fact, a multi-line key can be empty (this is the only way to get an empty key):

:
    > this value has an empty key

which becomes {"": "this value has an empty key"}.

asb commented 3 years ago

To share my 2 cents:

I think I have a similar point of view to @LewisGaul. Allowing a wider range of keys to be used sounds attractive, but I'm not sure the extra "oddness" of multiline keys is worth it. They're easily machine parseable, but I think potentially confusing to the human reader. Using a different prefix to : might be slightly better, but I'd still lean towards not adding it at all.

kalekundert commented 3 years ago

I'm late to this thread, but I thought it would be helpful to list some pros and cons (as I see them) of each proposed syntax, including one that @KenKundert and I have discussed but not mentioned yet:

I personally like the idea of adding a multiline key syntax, although I definitely appreciate the arguments against doing so. The ? syntax is growing on me more and more, too. I do think that we should at least make empty keys illegal, to keep open the possibility of adding the : syntax without breaking backwards compatibility again.

LewisGaul commented 3 years ago

Thanks for the summary, I'd agree with the pros/cons. I'm still -1 for multiline keys in general, looking at the options above.

I struggle to visually parse any of these as mappings, since I think YAML/JSON always require a colon separating the key and the value, which is missing from all of the above.

I think perhaps I'd be slightly less against it if the separating colon was added, e.g. (picking my preferred syntax from the options above):

? lorem
? ipsum
  : dolor sit amet

However, this doesn't play nice with values that aren't a simple single-line string:

? key with : unrestricted characters
  :
    > multi
    > line

I think maybe I still prefer this to a lack of requiring the colon though.

In fact, if the colon is still required, I guess in theory the keys could just be represented as regular multi-line strings (with >), but you'd be losing the "line type is context independent" invariant.

KenKundert commented 3 years ago

Okay, things seem like they have settled down. I have implemented some of the suggestions and checked them in. More work to be done, but just on completing these changes. No additional features are being considered.

The new version:

Details are in the documentation.

Example:

: key 1: the first key
    [[11, 12, 13], [21, 22, 23], [31, 32, 33]]

: key 2: the second key
    {alpha: α, beta: β, delta: δ, omega: ω, pi: π, tau: τ}

becomes

{
    "key 1: the first key": [
        ["11", "12", "13"],
        ["21", "22", "23"],
        ["31", "32", "33"]
    ],
    "key 2: the second key": {
        "alpha": "α",
        "beta": "β",
        "delta": "δ",
        "omega": "ω",
        "pi": "π",
        "tau": "τ"
    }
}

My expectation is that these new features will not be heavily used, but would be very helpful on occasion and helps to complete the language. With these changes, NestedText becomes capable of handling any hierarchical combination of lists, dictionaries and strings.

Please try it out and give me your impressions.

-Ken

LewisGaul commented 3 years ago

Thanks for this work! Although it does add some complexity to the language (albeit removing the complexity of quoted keys) I think this is a positive set of changes, as you say giving more completeness to the language.

I'll have a go at implementing these changes in zig-nestedtext at some point (maybe next weekend) and provide any feedback I might have.

One minor point - shouldn't the new version be 1.4.0 rather than 1.3.2, given the backwards incompatibility and size of the changes?

KenKundert commented 3 years ago

My version numbers are interpreted as follows: <major>.<minor>.<patch>. Stable releases are those where <patch> is 0. Those get pushed on to pypi. Anything with a nonzero patch is considered tentative and subject to change. They are for developers and only found on GitHub. The major component of the version number advances with changes that are not backward compatible in some significant way. Since this version loses quoted keys and so represents a change that is not compatible with version 1.3, it will eventually become 2.0.0 rather than 1.4.0.

torresjrjr commented 3 years ago

I think the inline lists and dictionaries features add adding huge complexity to an already near-perfect format. I'm personally strongly against those features. It breaks expectation, which simply kills the format for me. If I wanted those feautres, I'd use yaml.

No-quoted-keys are a good idea. They vastly reduce the complexity.

As for the multiline keys, I don't find them a bad idea. However, they solve what I consider are non-problems.

Currently, unquoted keys cannot contain colons, so multilined-keys are invented to allow colons and much more. But we already constrain keys with whitespace-trimming rules. I think it would be much more sensible to then also disallow ': ' and '\n' character sequences.

With NestedText, their is already a precedent that validation and schemas are left to the developer. I think setting the expectation for developers to remove ': ' and '\n' character sequences from their keys is better than allowing them. It's probably not good to encourage those characters sequences in the first place. It could be considered bad practice. If they are necessary, there is the meta-solution of replacing them with literals like '\n\r\t\x0a\u0123'.

Notes that ':' would still be allowed afaik, meaning you could use URLs as keys, like in json-ld.

kalekundert commented 3 years ago

Would you mind elaborating on the complexity that inline lists and dicts would add? From my perspective, they add a moderate amount of complexity to the parser, but very little complexity for the end user. The syntax is intuitive (i.e. common to many programming languages) and maintains the property that the type of each line can be identified from its leading characters.

Regarding YAML, I think its problem is not that it supports inline lists/dicts, but that its rules for quoting and type-casting are complicated and frequently break expectations (to use your phrase). But I don't see how the proposed changes to NT would break expectations in a comparable way.

I do think that there are advantages and disadvantages to the inline syntax, I just think the former outweigh the latter (see #24 for more discussion):

Advantages:

Disadvantages:

Ambiguous:

LewisGaul commented 3 years ago

I've tried out the new multiline object key syntax, and have come up with some edge cases that make me question the re-use of :.

As a user, what would you read the following as?

A :
 : B:
: C :
D :
: E

I think this could be interpreted in quite a range of ways, but as I understand it this is actually {"A": {"B:": ""}, "C :": "", "D": "", "E": ""}

I'd still be in favour of using ? instead of : for multiline keys. Perhaps this could also be improved slightly by requiring the object value be given explicitly when using the multiline key syntax? An empty string can be represented using >:

A :
 : B:
  >
: C :
 >
D :
 >
: E

Also, there seems to be a bug(?) in the Python implementation that disallows starting the file with a multiline key, giving a "content must start with key" error.

Slight nitpick: the latest language reference has a section headed 'Inline Objects' that's talking about both inline objects and lists. I understand objects are being referred to as 'dictionaries' but this is quite a pythonism. E.g. my understanding is that JSON (JS Object Notation) refers to key-value pairs as 'objects'.

KenKundert commented 3 years ago

I was going to respond to this on Github, but I cannot find it. Did you delete your post?

There was not a lot of thought that went into picking '>' as the string tag. We considered '|' briefly and decided to go with '>'. There was no a good reason for choosing one versus the other. It seemed like '>' had more use as a quoted string character. As you say, it was heavily used as such in email messages.
That led to syntax highlighting support in editors. So it seemed like the better choice. But fundamentally the choice was largely arbitrary. I think revisiting that choice would not be a good idea at this point.

-Ken

On Sat, May 01, 2021 at 02:04:55AM -0700, tototest99 wrote:

Hello and thank you very much for your work on NestedText as a human friendlier format. I’m very sorry to discover this project this late and to ask the kind of question I’m about to, on something which is probably settled since a long time, but is there an archive of thoughs process on the choice of > as multiline string marker? Especially in comparison to |. My reasoning was that > is very much used in emails or forums where it marks a response, with the symbol being oriented laterally, while something like | especially when chained:

address:
    | 2586 Marigold Lane
    | Topeka, Kansas 20682

seems IMHO a more natural barrier forming symbol to mark a simple block, from a human cognition / reading point of view (and especially in a text editor with less vertical spacing between the |s. This is a very niptick point and I will probably adopt NestedText for a project or two, but you know, just in case things where not engraved in stone, I sputtered the idea. Thanks again for this project. PS: not needing to quote keys is a good idea.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/KenKundert/nestedtext/issues/23#issuecomment-830584998

tototest99 commented 3 years ago

Yes, as it was too unimportant.

nevertheless, thank you for your response!

KenKundert commented 3 years ago

Lewis, Thanks for the bug report. I have fixed the issue. I have also removed references to inline objects.

I think your point about using colon to identify two distinct situations being confusing in some cases is a good one that we did not consider. It is worth considering using a completely different character for multiline keys.

KenKundert commented 3 years ago

Lewis, I went back and read your message more closely and found I had missed an important point. Your example is not legal NestedText. Specifically:

A :
 : B:
: C :
D :
: E

is not allowed because all keys must be paired with values. In fact, this is the suggestion you made. My implementation was accepting multiline keys without value, which probably caused the confusion, but it is now fixed.

LewisGaul commented 3 years ago

Thanks for clarifying that, I was indeed misled by the Python implementation - I've fixed my Zig implementation now (thanks for the tests update).

I've given some thoughts to a few alternative syntax options and included some notes below. I think requiring the value after a multiline key makes it a lot clearer though, so don't have particularly strong feelings on the options below - just thought it might be useful listing out some possible options :)


On the question of the character to indicate key lines, maybe a fairer variant of the above to consider would be as follows (where all colons are syntax).

A :
 : B
  >
: C
 >
D :
: E
: F
 >

JSON: {"A": {"B": ""}, "C": "", "D": "", "E\\nF": ""}

With question marks for object keys, for comparison:

A :
 ? B
  >
? C
 >
D :
? E
? F
 >

Also comparing an alternative I suggested briefly a while back, reusing regular multiline strings (where there is always exactly one colon separating the key and the value):

A :
 > B
  :
> C
 :
D :
> E
> F
 :

I just tried converting the above example to yaml (see below), and it does involve a question mark, but looks pretty odd, so either way I think NestedText will be doing better here!

A:
  B: ''
C: ''
D: ''
? 'E

  F'
: ''

Finally, comparing the three alternatives above with a more normal example (taken from the holistic_1 testcase)...

Status quo:

: - key1:
    > #value1:
:  #key2
    > multi
    > line

JSON: {"- key1:": "#value1:", " #key2": "multi\nline"}

Question marks for object keys:

? - key1:
    > #value1:
?  #key2
    > multi
    > line

Reusing multiline string syntax (showing the downside of requiring an extra level of indentation with just a colon sitting alone at an indent level!):

> - key1:
    : #value1:
>  #key2
    :
        > multi
        > line
KenKundert commented 3 years ago

Thanks for illustrating the alternatives. Currently we are expecting to choose between the leading colon or question mark to introduce key items. We are not considering your third alternative of reusing mutltiline strings for keys. I think it is likely that we will stay with the leading colon for multiline keys.

LewisGaul commented 3 years ago

Quick question on inline lists/objects: is [foo,] treated as ["foo"] or ["foo", ""]? I ask because the spec says

[,] is a list that contains a single empty string

but I would have read this as two items (separated by the comma). Allowing a trailing comma doesn't seem like a good fit when the inline structures must be on a single line. Is there really a need to allow specifying empty strings in inline structures given this confusion?

LewisGaul commented 3 years ago

I've now fully implemented inline lists and objects in zig-nestedtext, as well as multiline object keys (using a leading colon), which you can try out by downloading the nt-cli binary for converting to/from JSON at https://github.com/LewisGaul/zig-nestedtext/releases/tag/v0.2.0a.

The one deviation from the spec I currently have is that I disallow empty keys/values in inline objects/lists, and I'd like to see some discussion on this. My reasoning for this:

KenKundert commented 3 years ago

Empty inline strings must be followed by a comma to be recognized. For example, [] is an empty list and [,] is a list that contains a single empty string.

As you recognize, this is what allows us to distinguish between empty lists and those with a single empty string. Supporting empty lists and empty dictionaries is considered desirable because it provides a completeness to the language. With this it is now possible to represent any hierarchical combination of lists, dictionaries, and strings. Completeness increases ease of use because it eliminates exceptions that must be handled by user code that calls the dump functions.

It is unusual to use terminal commas on single line lists, but they are only required to identify empty terminal values. Other than that, while they may be unusual, they have become common on multiline lists and there is really no reason to outlaw them.

In my view, this behavior is a net positive.

LewisGaul commented 3 years ago

Supporting empty lists and empty dictionaries is considered desirable because it provides a completeness to the language.

Completely agree.

It is unusual to use terminal commas on single line lists, but they are only required to identify empty terminal values. Other than that, while they may be unusual, they have become common on multiline lists and there is really no reason to outlaw them.

This is where I disagree, from the perspective of ease of reading and understanding NestedText. A trailing comma seems completely unintuitive here to me, making it look like there's an empty string at the end after the comma (which there isn't). You say they've become common on multiline lists - I would agree, but strictly for multiline, not for single line (and in fact trailing commas are entirely disallowed in JSON).

Some examples: [1, 2, , 4] -> ["1", "2", "", "4"] [1, 2, 3, ] -> ["1", "2", "3"] [1, 2, 3, ,] -> ["1", "2", "3", ""]

In a way it's not as bad for objects since the colon is required as well as the comma, although it's then unclear whether a trailing comma should be needed if the last item contains an empty value... {a: 1, :, c:, :4, :} -> {"a": "1", "": "", "c": "", "": "4", "": ""}

That :4 actually reminds me of the multiline object key syntax, which makes it look a bit like a lone key to me...

I'm really not seeing the argument for allowing this though, seeing as it wouldn't be restricting the language at all to disallow. Inline object/list syntax is already one of the most complicated bits of the NestedText syntax, and allowing empty values just adds to potential mental overhead in trying to read NT files.

I also just noticed that it doesn't seem to be possible to use inline syntax at the root level in your Python implementation, is that intentional?

KenKundert commented 3 years ago

Once you allow terminal commas on multiline lists, it would be inconsistent to not allow them on single line lists. For example, Python allows lists, tuples, argument lists, etc to have terminal commas in both cases. Specifically, [1,2,3,] is a valid alternative to [1,2,3].

I don't think terminal commas impose a significant mental load on the user. One always needs to become familiar with the ways that a language works. This detail is one that most users will never encounter, and when they do, it is not hard to grasp; nor it is outside the norm. The only thing about NestedText that is different from other languages that allow terminal commas is that values can be completely empty. So it is a little different from Python, but not in a way that is artificial or confusing. I was able to explain it in one sentence in the documentation, and you easily recognized the problem it was designed to address.

When I said that it is increasingly common for languages to support terminal commas, I was extrapolating from the fact that Python has supported them for a very long time and the fact that a lot of people complain that JSON does not. So I assumed that they were common in other languages, but I don't actually know that to be true. Actually, Python has a very similar situation and rule for tuples (tuples are immutable lists). An empty tuple is represented with () or (,), a tuple with one element is represented with (a,), a tuple with two elements is represented with (a,b) or (a,b,). So, terminal commas are accepted for all tuples, and required for one element tuples. Single element tuples are treated special because (a) is a valid expression that evaluates to a; the comma is needed to distinguish the expression from the tuple.

The Python implementation allows inline lists and dictionaries at the top level. Be sure to specify top=any to the argument list to load or loads, otherwise the top-level is restricted to dictionaries.

LewisGaul commented 3 years ago

Once you allow terminal commas on multiline lists, it would be inconsistent to not allow them on single line lists.

Sure, but NestedText doesn't allow multiline lists.

The only thing about NestedText that is different from other languages that allow terminal commas is that values can be completely empty.

Thinking about it, I'm against empty values being allowed in the inline list/object syntax independent of this discussion about trailing commas (although if you remove empty values and you remove any need for allowing trailing commas). As you point out, there may be precedent for trailing commas (albeit as an extension of mutliline structures), but I'm not aware of precedent for empty values in this kind of syntax - even YAML disallows this! Things like {:[,,],} just look a bit odd to me.

I was able to explain it in one sentence in the documentation, and you easily recognized the problem it was designed to address.

It's not hard to explain, it just feels like a gotcha that needs explaining, rather than being obvious at first sight.

The last thing I have to say about this is that it would be much easier to later add support for empty values if a need arises than to deprecate support for empty values if it turns out to be a disliked feature. I would very much like to hear other people's input on this, and have slight concern about the possible addition of unnecessary complexity/ambiguity when reading this data format.

KenKundert commented 3 years ago

Its been almost three weeks since the last comment, so I think its is time to make the call. I have been living with the proposed changes as implemented in the current version on github and am comfortable with that version. Compared to version 1.3 it:

  1. removes quoted keys
  2. adds multiline keys
  3. adds single line lists and dictionaries It also uses the convention in single line lists that an empty string must be followed by a comma to be recognized.

Unless I hear any final comments, I will be re-releasing the current version as the next stable release, version 2.0, in the next few days.

KenKundert commented 3 years ago

Version 2.0 has been released.