Impossible to represent empty lists/objects

LewisGaul commented 3 years ago

Currently it is impossible to represent empty lists/objects in NestedText.

Relatedly, the language reference says the following about an empty document: "An empty document corresponds to an empty value of unknown type.".

The reason for this is that list/object items can only be represented by the presence of lines in a certain format, and the 'absence' of lines (e.g. blank lines) is not something that can be interpreted as a particular type.

The difference with YAML here is that inline collections (a.k.a. 'flow style') are not supported. The justification for this is it allows the simple statement that all values are interpreted as strings (just like how 'null' is always treated as a string).

Given the proposal to add multi-line keys (https://github.com/KenKundert/nestedtext/issues/23) based on a desire to make NestedText 'completely general', I'm wondering if it's been considered to add a way to express empty containers? This could be done by allowing 'flow-style'-like syntax but only permitting its use for empty containers, and requiring they be placed on their own line:

foo:
  []
bar:
  {}
baz:

This has the following nice properties:

Already valid YAML syntax
Backwards compatible change (the meaning of all previously valid syntax remains unchanged)
Maintains the property of every line type being identifiable without context of other lines
Could potentially provide a way to disambiguate the meaning of an empty file (make it an empty string, given the new way to represent empty collections)

Problems:

What to do with a file containing only [] or {}?
- I was going to propose that this should provide a way for a file to represent an empty collection, but actually this would be backwards incompatible as it's currently interpreted as a string.
- Note, however, that this wouldn't prevent from having a file corresponding to those strings, since > [] or > {} could be used.

Related discussion about removal of flow-style in strictyaml: https://hitchdev.com/strictyaml/why/flow-style-removed/ (see Counterarguments).

KenKundert commented 3 years ago

I believe I said, or at least intended to say, that it is desirable for NestedText to be completely general in the sense that it supports any hierarchical combination of dictionaries, lists, and strings where all of the leaf values are strings. Empty lists and dictionaries do not fit this description. We did consider supporting empty lists and dictionaries, but we could not find a way to fit them into the NestedText philosophy of taking text literally. If we take YAML's lead and support flow-style, then it becomes ambiguous whether [] is a string or empty list. Same with {} of course.

We addressed this issue at the top level because an empty file is ambiguous. We could have made it an error, but we decided not to because in the Python implementation you can specify the top-level type you are expecting when you read a file. That resolves the ambiguity.

LewisGaul commented 3 years ago

We did consider supporting empty lists and dictionaries, but we could not find a way to fit them into the NestedText philosophy of taking text literally. If we take YAML's lead and support flow-style, then it becomes ambiguous whether [] is a string or empty list. Same with {} of course.

Doesn't my suggestion provide one way of satisfying this? String values must either be on the same line as a list marker (-) or object key (foo:), or preceeded with the [multiline] string marker >. If you enforce [] and {} being on their own line then the string handling is unaffected - this is then just two new line types.

A list of (not-ignored) line types would then be:

- one-line string
-
  - list
-
  key: value
-
  > multi-line string
-
  # Empty list:
  []
-
  # Empty object:
  {}

Just a thought, for completeness - this seems like something that could definitely be useful in some cases. I don't feel overly strongly about it though.

KenKundert commented 3 years ago

Yes, I guess it does. And as you say it might be useful, but all it is doing is implying type, which is counter to the NestedText ethos. And it introduces two new tokens, [] and {}, and those tokens appear to invite people to put values within them.

I am not sure this enhancement is a good idea.

LewisGaul commented 3 years ago

all it is doing is implying type

I somewhat agree... I suppose your suggestion is for the application to have the responsibility of interpreting a string as an empty list/object (which would perhaps be indicated by an empty string)? It could be argued that lists/objects are structure more than types, and after all, NestedText already supports them (otherwise the text wouldn't be able to nest!).

And it introduces two new tokens, [] and {}, and those tokens appear to invite people to put values within them.

I agree with this concern.

I don't have an actual use-case, so as I said I don't feel overly strongly about it either way. Just thought worth raising for discussion as I hadn't seen this possibility mentioned anywhere. I don't mind if you just want to close the issue.

One minor related point on the spec for an empty document... You said that the type of an empty document (string/list/object) is intentionally ambiguous so that "in the Python implementation you can specify the top-level type you are expecting when you read a file". However, aren't you also saying that the application (aka 'Python implementation') can interpret any string values as types, e.g. a particular leaf could be defined a list type in the application schema, and in this case it could make sense for an empty string to be interpreted as an empty list? That doesn't mean that the type of the empty string leaf has an ambiguous type in NestedText - it's a string, same as always. Why are the rules different at the root level? Can we tighten up the spec slightly by saying an empty file is treated as an empty string? This doesn't prohibit applications from interpreting that empty string as an empty container, and is entirely consistent with the rest of the spec.

KenKundert commented 3 years ago

The problem with just defining an empty file to be a string is that if the application is expecting and assumes a dictionary or list for the top level and if it opens an empty file and gets a string, it will likely crash. The behavior for the Python implementation is that if you specify the type of the top-level object and the file is empty, you get an empty object of the expected type. It was felt that this approach was less demanding for the application writers: just specify what you want and the package does the proper coercion or error reporting.

kalekundert commented 3 years ago

I haven't read everything in this thread carefully, but I'll briefly add that I frequently use empty strings to represent empty dicts/lists. (My primary application is defining parameters for unit tests, and for any parameter that's a dict/list, it's inevitable that an empty dict/list will be a good test case.) In these cases, I just instruct the schema to interpret the empty string as an empty dict/list. This is pretty easy and I think fits well with the overall philosophy of nestedtext. I like the syntax you came up with, but I think the disadvantages of adding it would outweigh the advantages.

Perhaps it would be more parsimonious to also apply the same thinking to the top-level data structure. Our thinking as I recall it was that type ambiguities at the top level would be more likely to cause problems than at lower levels, and that it's generally good if function return types don't depend on user input, so it made sense to provide a means for the programmer to specify the top level type.

LewisGaul commented 3 years ago

Yeah, I can see the reasoning, and I think I'm convinced that adding my proposed syntax isn't worth it.

I would like to remove the ambiguity over the empty file though - I'm not particularly convinced by the argument against that.

I've just got zig-nestedtext running against the testsuite and discovered the empty1 testcase which expects reading in an empty NestedText file to dump out as null in JSON. If a NestedText parser is supposed to treat an empty file as any of 'empty string', 'empty list', or 'empty object', then how is that supposed to get back to a 'null' in JSON? Surely there isn't an expectation that empty strings get translated to null?

In truth, I originally had zig-nestedtext parsing an empty file as a null object, but this seems fundamentally wrong as it effectively corresponds to an empty file being absence of a NestedText value, when really it should be interpreted as an empty value. I've now changed it and had to arbitrarily pick that it should be a string.

nestedtext.Value type Selecting string type for empty file

it's generally good if function return types don't depend on user input

I think this statement is perhaps based on familiarity with Python - it doesn't apply to languages that have union types such as Zig, where I'm having to arbitrarily pick one of the types (or could let the user specify, but there seems to be little value in doing so in the absence of applying a schema).

kalekundert commented 3 years ago

Regarding empty files: I agree that the most logically consistent thing to do would be to parse it as an empty string. The problem is that—perhaps only in dynamically-typed languages like python—this also creates a bug magnet. I think its fair to assume that the top-level data structure in most applications will be a dictionary. The developers for such applications are expected to provide a schema detailing how to interpret that dictionary, but it's easy to forget that that schema needs to explicitly include an instruction to interpret an empty string as an empty dictionary. Without this, the application will crash given user-input that was probably meant to be valid.

To be a little more concrete about the problem, consider pydantic. I believe that pydantic is the most popular schema library for python at the moment. However, it requires that the top-level data structure is a dictionary and has no way to specify that an empty string at the top level should be treated as an empty dictionary. Therefore, this would require an application developer using nestedtext in conjunction with pydantic to manually check if the output from nt.load() is an empty string, and to manually convert it to an empty dictionary if so, before passing it on the pydantic. We regarded this as simply too much of a gotcha/too much boilerplate for an overwhelmingly common use case, so we eventually decided to tweak the spec and the python API to allow the application developer to specify the top-level type.

Perhaps a better way to word the spec would be to say that the type of an empty document is implementation-dependent, rather than saying it is unknown. After all, we chose the behavior we did to accomodate a specific problem in python. But I can see that it could make sense to treat an empty file as an empty string in a statically-typed language that could force the programmer to handle the top-level string case.

LewisGaul commented 3 years ago

Thanks for explaining the thought process, I can see where you're coming from. FWIW I was already interpreting the spec as saying the type of an empty file could be implementation dependent, so that rewording makes no difference to me - I'm just not a big fan of there being undefined behaviour in the spec.

Maybe the spec could be reworded just slightly to say something like "by default an empty file represents an empty string, but an implementation may choose to allow the user to specify the root document type as list/dict, in which case an empty file would be interpreted as an empty list/dict"? Admittedly, having the Python implementation then default to the dict type (which I understand you're saying is desirable) might then be considered a bit dubious. Something in this direction would be good from my perspective though.

Can we do something about the empty1 case in the testsuite I mentioned? I can't see any way it makes sense for an empty NestedText file to end up translating to a JSON null. I think either the type of an empty file should be well-defined and map to one of "", [], or {}, or this testcase needs to be removed since there's no well-defined conversion.

kalekundert commented 3 years ago

Regarding the flow-style syntax: This is actually growing on me. I think @KenKundert is right that allowing [] and {} would invite people to put values in them. I initially regarded this as a deal-breaker, but on further thought I think there are actually some good arguments in favor of adding a single-line list/dict syntax:

The lack of flow-style was one of the primary complaints we saw when we first introduced the spec. I think there are two compelling forms of this complaint:
1. Nestedtext is meant to be easy for humans to read/write, but it's unnatural to read/write small collections of values that are spread out over multiple lines.
2. Nestedtext files often end up "tall and narrow" relative to a typical 80-char wide window, and as a result don't fit as much information on the screen as formats such as YAML or TOML. This makes it harder to grok all the information in such files.
Our response to these complaints was that you can always parse list/dict values from string values (and in fact I often do exactly this). This response is kind-of a cop-out, though, since clearly the purpose of nestedtext is to encode the structure of the data.
This would create a parallel to strings, which already have single- and multi-line forms.
It would become possible to specify empty lists/dicts. I don't know if this is really much of a benefit, for the reasons discussed above, but it is something. It would at least make some schemas a bit simpler.

I should also be clear about the exact syntax that I have in mind:

List values and dict key/value pairs would be separated by commas, and would not be allowed to contain any of the following characters: ,[]{}. The prohibition on [] and {} would apply to both dicts and lists, to leave open the possibility of supporting nested flow-style data structures. I'm not sure if the item separator should , or ,␣. , is what I would initially expect, but more in line with how nestedtext parses dictionary keys and list items would be to split on ,␣ and allow values that contain , not followed by ␣. My instinct would be to simply split on , though.

List:
```
[a, b, c]
```
Dictionary:
```
{k1: v1, k2: v2}
```
Only single-line flow-style data structures would be supported. There's no need to support multi-line flow-style, since nestedtext already has (nicer) syntax for that. This also greatly simplifies parsing and maintains the property that each line type can be identified just by looking at its first character.

Not ok:
```
[
a,
b,
c,
]
```
Nested flow-style data structures would not be allowed. This is a restriction that could probably be lifted in the future, but for now would make the implementation simpler and would not affect many use-cases.

Not ok:
```
[[1, 2, 3], [4, 5, 6]]
```
I'm not sure how to handle trailing commas, e.g. [a, b, c,]. Most formats (e.g. python, TOML) ignore them, but the point of this is to make it easy to add/remove lines from multi-line data structures, which wouldn't apply here. These same formats also require quotes or brackets or something to identify a value, and so a trailing comma is clearly distinct from a comma with a value after it. But in nestedtext a trailing comma could be reasonably interpreted as a comma followed by an empty string. My instinct would be to go with that interpretation, but I'm not sure.

There are also arguments against this syntax, although I haven't thought of any yet that I find very compelling:

It does provide two ways to do things, which goes against the "there should be one—and preferably only one—obvious way to do it" philosophy that I generally subscribe to. However, I think it would be pretty obvious to use the flow-style for short data structures and the multiline style for longer ones. You could also argue that the lack of flow style is non-obvious, because most people expect these kinds of structured data formats to have it.
The strictyaml thread linked above argues that flow-style hampers readability, but I don't really buy that. Ultimately the author of a file is responsible for keeping it readable, and I think the flow-style syntax would provide a tool that could be used to help with that (even if it could also be misused). The requirement for flow-style data structures to fit on a single line also significantly limits the scope for abuse.
The strictyaml thread also mentions that braces can complicate templating. I don't disagree with this, but I don't think it's a major concern. It shouldn't often be necessary to template nestedtext files, and in any case you can always tell a templating engine to use different brace characters.

kalekundert commented 3 years ago

Regarding the empty test case: Yeah, I'm not even sure how that test passes at the moment. I'll have to look more closely at it, but the suggestion to remove it makes sense.

LewisGaul commented 3 years ago

I think I'm +1 on supporting simple flow-style as you propose. I considered this a potential extension to my original proposal - perhaps I should have mentioned it!

The two most compelling points for me are:

Being able to represent empty containers
Concisely writing short/nested containers, e.g. this argument on strictyaml about writing matrices: https://github.com/crdoconnor/strictyaml/issues/20

Haven't thought too hard about it the delimiter and parsing of whitespace/empty value etc., there's definitely some thought that would be needed there, especially in the absence of quotes. I'd be half tempted to disallow whitespace in keys/value in flow-style (and then have ,_ as the delimiter).

KenKundert commented 3 years ago

Concerning the issue of empty files. They way I think of this is that NestedText supports four data types: string, list, dict, and unknown. The unknown data type only occurs at the top-level and only when an empty file is encountered. In each case, how an implementation represents each of these four types its choice. In the Python implementation I chose str, list, dict, and None. In the NT test cases, Kale chose JSON's string, list, dictionary, and null. Another viable approach would be to use the languages native objects for strings, lists, and dictionaries, and then raise an exception for unknown.

KenKundert commented 3 years ago

This issue has been addressed in v2.0.

KenKundert / nestedtext

Impossible to represent empty lists/objects #24