KenKundert / nestedtext

Human readable and writable data interchange format
https://nestedtext.org
MIT License
362 stars 13 forks source link

Zig implementation progress/thoughts #21

Closed LewisGaul closed 3 years ago

LewisGaul commented 3 years ago

I'll comment out the pointer to zig-nestedtext for now. Let me know when it is ready.

Originally posted by @KenKundert in https://github.com/KenKundert/nestedtext/issues/20#issuecomment-787214182

@KenKundert zig-nestedtext is looking pretty good now, I think it's handling everything in the spec except for quotes around object keys (and the error reporting is extremely minimal). My next job will be to hook it up to your suite of testcases to iron out these remaining bits.

The main aspect of the language spec I'm questioning at this point is the fact that keys may be quoted - is there really a need to allow keys to start with - or >, or contain : or whitespace? This feels like unnecessary complexity for a language spec that strives to be simple, and indeed leads to more than two thirds of the text in the file format spec for dictionary lines:

The key must be quoted if it:

  • starts with a list-item or string-item tag,
  • contains a dict-item tag,
  • starts with a quote character, or
  • has leading or trailing spaces or tabs.

A key is quoted by delimiting it with matching single or double quote characters, which are discarded. Unlike traditional programming languages, a quoted key delimited with single quote characters may contain additional single quote characters. Similarly, a quoted key delimited with double quote characters may contain additional double quote characters. Also, backslash is not used as an escape character; backslash has no special meaning anywhere in NestedText.

A quoted key starts with the leading quote character and ends when the matching quote character is found along with a trailing colon (there may be white space between the closing quote and the colon). A key is invalid if it contains two or more instances of a quote character separated from :␣ by zero or more space characters where the quote character in one is a single quote and the quote character in another is the double quote. In this case the key cannot be quoted with either character so that the separator from the key and value can be identified unambiguously.

Here's an example of it working (modulo spaces in object keys and object field ordering):

$./zig-cache/bin/nt-cli -f samples/employees.nt | jq
{
  "treasurer": [
    {
      "email": "fumiko.purvis@hotmail.com",
      "name": "Fumiko Purvis",
      "address": "3636 Buffalo Ave\nTopeka, Kansas 20692\n",
      "phone": "1-268-555-0280",
      "additional-roles": [
        "accounting task force"
      ]
    },
    {
      "email": "merrill.eldridge@yahoo.com",
      "name": "Merrill Eldridge",
      "phone": "1-268-555-3602"
    }
  ],
  "vice-president": {
    "email": "margaret.hodge@ku.edu",
    "name": "Margaret Hodge",
    "address": "2586 Marigold Lane\nTopeka, Kansas 20682\n",
    "phone": "1-470-555-0398",
    "additional-roles": [
      "new membership task force",
      "accounting task force"
    ]
  },
  "president": {
    "email": "KateMcD@aol.com",
    "name": "Katheryn McDaniel",
    "address": "138 Almond Street\nTopeka, Kansas 20697\n",
    "phone": {
      "cell": "1-210-555-5297",
      "home": "1-210-555-8470"
    },
    "additional-roles": [
      "board member"
    ]
  }
}
KenKundert commented 3 years ago

There are two conflicting desires with NestedText. First, as you point out, is a desire to make the spec very simple and easy to understand and learn. Second, it the desire to be general. Specifically to accept and support all possible strings as keys and values. There is a tension between these two goals, and NestedText is not perfect on either. NestedText will accept any text string as a value, but there are some limitations on keys. And, as you point out, we could simplify the specification if more restrictions were placed on the keys.

I chose the current approach because it seemed like a good balance. It could handle almost any text string as a key. While it may be a little complicated to unambiguously specify exactly when a key must be quoted, I think most people will not need internalize the specifics. Instead, I think most people will follow the following basic rule:

If a key contains characters that could create ambiguities, quote it.

Concerning whether the need for quoted keys ever comes up, I have not had the need to use them yet in my use of NestedText, but I have come very close. I routinely have spaces and special characters like commas in my keys. But I do have an application for which colons appear in the keys. So far I have avoided the need for quoting because the colon need not have a trailing space, but the application allows it, so if NestedText did not allow quoted keys, then restrictions in NestedText would end up constraining the application.

Most users won't need quoted keys, and so will likely never bother to learn about them. So, I don't feel like quoted keys add significant complexity for the average user, but if needed, they do allow for many keys that would not be allowed otherwise.

LewisGaul commented 3 years ago

I hear what you're saying, and perhaps we have different use-cases in mind, because personally I see no real issue with keys having restrictions (being like variable names in programming languages such as Python, although slightly more lenient). Do you have examples of when you'd really want spaces/colons in keys?

While it may be a little complicated to unambiguously specify exactly when a key must be quoted, I think most people will not need internalize the specifics.

Yes, agreed, but the people who will need to internalise the specifics are those implementing a parser (e.g. me!). I was primarily intending to highlight how much it complicates the spec, introduces edge-cases where there are otherwise very few, and adds complexity to any implementations (giving more room for bugs/deviations from the spec).

I'd also like to highlight how hard it is to visually parse the current syntax:

:\t::     
" bar ": ": 2
>baz:3 : 4 : 5

{":\\t:": " ", " bar ": "\\": 2", ">baz:3": "4 : 5"}

In particular, even having read the spec carefully I expected the second one to give me a key of bar ":, but then realised this is part of the long description of special handling I pasted above - this is in fact only possible if you quote with single quotes, but even that becomes impossible if there's a single quote inside the key. If a user were to hit these weird edge cases for some reason I think they might end up being very confused by the non-standard quoting rules.

I would propose disallowing whitespace and colons in keys, and also disallow starting with '>' or '-' even when they're not followed by a space. Just because it can be parsed unambiguously (with enough effort) doesn't mean it's a good idea - it's important for it to be easy to visually parse (code is usually read more times than it's written).

One other point - I'm not aware of a way to specify a multiline string that doesn't end in a newline, have you thought about that? It's the only case of it being impossible to specify a value that I'm aware of.

KenKundert commented 3 years ago

First, on multiline strings

> Hey now!

becomes "Hey now!", whereas

> Hey now!
>

becomes "Hey now!\n"

KenKundert commented 3 years ago

Spaces are quite common in keys. For example, here is a small fragment of a NestedText file that contains information about a loan:

    interest rate: 4.5%
    origination fee: 2% 

If spaces where eliminated from keys, then interest rate would need to be converted to _interestrate or interest-rate. Neither is natural for non programmers.

Concerning other more unusual cases, I have an application that is very important to me that uses NestedText to describe programming objects using dictionaries. You can think of these objects as variables or function arguments, but they are a bit more than that. Anyway, in the simplest cases, I need to give things like the name, the type, the dimension, the units, a description, and initial value, etc. One can use something like this:

    vdd:
        type: electrical
        initial value: 2.5V
        description: supply rail voltage
        units: V

But it can be a bit heavy to type all that, so I allow it to be condensed as follows:

    electrical vdd = 2.5V: supply rail voltage (V)

I would then pick apart the key and value to find the information I need.

The problem for me comes when I specify vectors, because I need to give the first and last index values separated by a colon:

    logic signed [5:0] gain = 0: amplifier gain (dB)

So far I have not needed to quote the key because there is no space after then colon, but I am very close.

As for commas, it is common for keys to contains names, but sometimes I also want to allow the user to provide aliases, so I would allow something like this:

    current signal, isig, i: analog current signal

These are examples of things I like to do with NestedText. I treat NT as a lexer that performs the first step of a big parsing task; it breaks the input down into small easily recognized pieces. Then the challenge for me becomes: how to structure the data so that it is easiest to enter and to read. And often I end up trying to convey multiple pieces of information in both the key and the value. This increases the density of information, and can make the information much less tedious to enter and read.

LewisGaul commented 3 years ago

I overlooked the multiline strings handling in my implementation - makes sense, I'll fix that up. EDIT: I had implemented it correctly after all, not sure what confused me here! I guess line endings from all other lines are maintained, so it feels a little inconsistent not to for the last line, but it does just make sense (otherwise the only way to do it is for it to be on the last line in the file!).

Thanks for spelling out examples of keys with special characters.

I guess if it were me I'd be fine with inital_value or initial-value. I also find it much easier to visually parse when there are no spaces in keys, and would have thought this would apply at least as strongly to non-programmers. Having to learn that keys can't have spaces seems like a reasonable thing to expect non-programmers to learn.

In the other 'condensing' cases it's not clear to me why you need to use objects at all, why not use a list instead? The split between key and value seems somewhat arbitrary based on the original object definition you showed.

- logic signed [5:0] gain = 0: amplifier gain (dB)
- current signal, isig, i: analog current signal

The application could then do the split on the first : if required. This moves complexity out of the language spec and into the application handling, which seems to be the preference in many other aspects (such as giving no special meaning to integer/float/bool/null fields).

KenKundert commented 3 years ago

There is more to my application than I presented here. Other non relevant factors preclude the use of lists. But that is not the point. The point is that there are valid applications of NestedText that benefit from not limiting keys to be identifiers. In my case, it was just such an application that drove the creation of NestedText.

kalekundert commented 3 years ago

For what it's worth, I agree with @LewisGaul for the most part. When I was trying to implement syntax highlighting, something like 90% of the effort and debugging went into handling dictionary keys right. It's too complicated relative to the rest of the spec, especially since (i) it doesn't even allow arbitrary keys and (ii) it's basically illegible when combined with values that also contain quotes and colons.

I like allowing spaces and punctuation in keys; applications can always decide to not use such characters if that makes sense for them, and I'm constantly frustrated by the fact that TOML only allows - and _ in unquoted keys (, and / are two characters in particular that I like to use). But I think colons in keys need to be either forbidden or escaped (e.g. prefereably by ::, since \: or similar would also require adding \\ to escape \). For what it's worth, Eric Raymond agrees that escaping is more unix-y than quoting.

Maybe a good compromise would be to either split on the first occurence of :␣ (that is, colon followed by space), or to require that :␣ be escaped as ::␣. Looking at the "gain" example, I don't think it's hard to visually parse that the key ends after the second colon, because a colon followed by a space is visually distinct from a colon sandwiched between two characters. But examples like " bar ": ": 2 are very hard to parse because you have to mentally keep track of where each quote is, where each colon is, and which quotes actually count. In contrast, escape characters are easy to parse because you just need to look at each character to know what it does.

I also like the suggestion to forbid keys starting with - or > even if they could be unambiguously parsed, to make sure the files remain easy for humans to comprehend. It would be inconsistent to forbid keys starting with those characters while allowing keys containing colons, but maybe that's justified by colons being more useful.

KenKundert commented 3 years ago

I am not overly compelled by the argument that " bar ": ": 2 is hard to interpret because the alternatives are also hard to interpret.

I am a bit more compelled by the argument that keys are currently not completely general, but I have never come close to coming up with a case that was not accepted unless I carefully designed it.

I am also compelled by the fact that the approach used in quoting is largely orthogonal to other concepts in the language, and is non standard to boot. Programmers will be surprised by it.

I don't like the idea of eliminating keys that start with - or > because I can envision applications where this would be desirable. For example, with something like AppCLI, I can imagine specifying command line options as follows:

Options:
    -c <cfgname>, --config <cfgname>:  Specifies the configuration to use.
    -d, --dry-run:                     Run Borg in dry run mode.
    -h, --help:                        Output basic usage information.
    -m, --mute:                        Suppress all output.
    -n, --narrate:                     Send emborg and Borg narration to stdout.
    -q, --quiet:                       Suppress optional output.
    -r, --relocated:                   Acknowledge that repository was relocated.
    -v, --verbose:                     Make Borg more verbose.
    --no-log:                          Do not create log file.

One possible approach to avoid quoting and escaping is to accept multiline strings as keys. Something like:

    >   ␣bar ":␣
        : 2

which would be converted to {' bar ": ': '2'} (␣ in this example represents a space). This allows for arbitrary keys and eliminates quoting, and it builds off concepts already in NestedText.

LewisGaul commented 3 years ago

I am not overly compelled by the argument that " bar ": ": 2 is hard to interpret because the alternatives are also hard to interpret.

The alternative I'd like to see is for it to be a syntax error - that's not hard to interpret :)

Your example of declaring a CLI schema is interesting, and something I've thought about a bit before. I guess my reaction to this is that I don't really see the need for it to be NestedText - you're presumably already imposing other restrictions/conventions on the format for it to be interpreted by the application, at which point I think for me this crosses the line into making sense as being treated as a custom format.

I'm not sure I like the multiline strings as keys idea, this looks like added/moved complexity where I'm looking for simplicity.

KenKundert commented 3 years ago

Lewis, Kale and I talked about this for quite a while today. The conclusion we have come to is that we should remove quoting all together and leave multiline keys in our back pocket to pull out if there is a strong reason to do so. But we would not go as far as you recommended in that spaces and special characters are still allowed in the keys. Thus, [5:0] would be allowed in a key, but not [5:␣0]; -c <cfgname> would be allowed, but not -␣c <cfgname>, >>> x = 3 would be allowed but not >␣>> x = 3, and leading white space is not allowed and trailing white space is ignored. In addition, " bar ": ": 2 will not be a syntax error, it is instead interpreted as {'" bar ": ': ": 2'}.

In other words, the rules becomes very simple, if a line starts with -␣ or -↲ it is a list item, if it starts with >␣ or >↲ it is a string item, if it contains :␣ or :↲ it is a dictionary item and every thing before the first :␣ is the key and everything after is the value.

What do you think?

LewisGaul commented 3 years ago

That certainly sounds like an improvement from my perspective, thanks for giving this thought.

In summary, I interpret this to mean the following keys would be impossible to express (without being a syntax error):

This sounds like a reasonable compromise to me - I'll aim for this with the Zig implementation.

One thing I've been glossing over so far is what 'whitespace' means, and where we're strictly talking about space characters as opposed to tabs or other unicode whitespace characters. Newline characters fall into the category of whitespace but understandably have their own rules, and there are more edge cases from the three different line-end sequences. I'll probably raise some questions on whitespace in a separate issue at some point, if you'd be interested to discuss the finer details.

If you'd like to add the Zig implementation to the list of implementations in the README then feel free - it's at the stage of being usable even if not yet perfect :)

KenKundert commented 3 years ago

Lewis, be aware that we could change our mind. I will be playing with the multi line key idea. If it looks to be simple and clean, we might just include it so as to be more general.

asb commented 3 years ago

Interesting discussion. I agree that the relative complexity of the current key-quoting rules seem out of keeping with the rest of the NestedText spec.

One thing that wasn't immediately obvious to me is why to ignore trailing whitespace on keys?

If there's a desire to increase the set of keys that can be expressed it may be worth considering \ to escape a single character. The trade-off is of course the \ is more difficult to express, which may be irritating if windows-style paths are used as keys.

LewisGaul commented 3 years ago

See https://github.com/KenKundert/nestedtext/issues/16#issuecomment-718904027:

All white space after a key is ignored to allow values to be lined up without affecting the keys.