KenKundert / nestedtext

Human readable and writable data interchange format
https://nestedtext.org
MIT License
362 stars 13 forks source link

Impossible to represent empty lists/objects #24

Closed LewisGaul closed 3 years ago

LewisGaul commented 3 years ago

Currently it is impossible to represent empty lists/objects in NestedText.

Relatedly, the language reference says the following about an empty document: "An empty document corresponds to an empty value of unknown type.".

The reason for this is that list/object items can only be represented by the presence of lines in a certain format, and the 'absence' of lines (e.g. blank lines) is not something that can be interpreted as a particular type.

The difference with YAML here is that inline collections (a.k.a. 'flow style') are not supported. The justification for this is it allows the simple statement that all values are interpreted as strings (just like how 'null' is always treated as a string).

Given the proposal to add multi-line keys (https://github.com/KenKundert/nestedtext/issues/23) based on a desire to make NestedText 'completely general', I'm wondering if it's been considered to add a way to express empty containers? This could be done by allowing 'flow-style'-like syntax but only permitting its use for empty containers, and requiring they be placed on their own line:

foo:
  []
bar:
  {}
baz:

This has the following nice properties:

Problems:

Related discussion about removal of flow-style in strictyaml: https://hitchdev.com/strictyaml/why/flow-style-removed/ (see Counterarguments).

KenKundert commented 3 years ago

I believe I said, or at least intended to say, that it is desirable for NestedText to be completely general in the sense that it supports any hierarchical combination of dictionaries, lists, and strings where all of the leaf values are strings. Empty lists and dictionaries do not fit this description. We did consider supporting empty lists and dictionaries, but we could not find a way to fit them into the NestedText philosophy of taking text literally. If we take YAML's lead and support flow-style, then it becomes ambiguous whether [] is a string or empty list. Same with {} of course.

We addressed this issue at the top level because an empty file is ambiguous. We could have made it an error, but we decided not to because in the Python implementation you can specify the top-level type you are expecting when you read a file. That resolves the ambiguity.

LewisGaul commented 3 years ago

We did consider supporting empty lists and dictionaries, but we could not find a way to fit them into the NestedText philosophy of taking text literally. If we take YAML's lead and support flow-style, then it becomes ambiguous whether [] is a string or empty list. Same with {} of course.

Doesn't my suggestion provide one way of satisfying this? String values must either be on the same line as a list marker (-) or object key (foo:), or preceeded with the [multiline] string marker >. If you enforce [] and {} being on their own line then the string handling is unaffected - this is then just two new line types.

A list of (not-ignored) line types would then be:

- one-line string
-
  - list
-
  key: value
-
  > multi-line string
-
  # Empty list:
  []
-
  # Empty object:
  {}

Just a thought, for completeness - this seems like something that could definitely be useful in some cases. I don't feel overly strongly about it though.

KenKundert commented 3 years ago

Yes, I guess it does. And as you say it might be useful, but all it is doing is implying type, which is counter to the NestedText ethos. And it introduces two new tokens, [] and {}, and those tokens appear to invite people to put values within them.

I am not sure this enhancement is a good idea.

LewisGaul commented 3 years ago

all it is doing is implying type

I somewhat agree... I suppose your suggestion is for the application to have the responsibility of interpreting a string as an empty list/object (which would perhaps be indicated by an empty string)? It could be argued that lists/objects are structure more than types, and after all, NestedText already supports them (otherwise the text wouldn't be able to nest!).

And it introduces two new tokens, [] and {}, and those tokens appear to invite people to put values within them.

I agree with this concern.

I don't have an actual use-case, so as I said I don't feel overly strongly about it either way. Just thought worth raising for discussion as I hadn't seen this possibility mentioned anywhere. I don't mind if you just want to close the issue.

One minor related point on the spec for an empty document... You said that the type of an empty document (string/list/object) is intentionally ambiguous so that "in the Python implementation you can specify the top-level type you are expecting when you read a file". However, aren't you also saying that the application (aka 'Python implementation') can interpret any string values as types, e.g. a particular leaf could be defined a list type in the application schema, and in this case it could make sense for an empty string to be interpreted as an empty list? That doesn't mean that the type of the empty string leaf has an ambiguous type in NestedText - it's a string, same as always. Why are the rules different at the root level? Can we tighten up the spec slightly by saying an empty file is treated as an empty string? This doesn't prohibit applications from interpreting that empty string as an empty container, and is entirely consistent with the rest of the spec.

KenKundert commented 3 years ago

The problem with just defining an empty file to be a string is that if the application is expecting and assumes a dictionary or list for the top level and if it opens an empty file and gets a string, it will likely crash. The behavior for the Python implementation is that if you specify the type of the top-level object and the file is empty, you get an empty object of the expected type. It was felt that this approach was less demanding for the application writers: just specify what you want and the package does the proper coercion or error reporting.

kalekundert commented 3 years ago

I haven't read everything in this thread carefully, but I'll briefly add that I frequently use empty strings to represent empty dicts/lists. (My primary application is defining parameters for unit tests, and for any parameter that's a dict/list, it's inevitable that an empty dict/list will be a good test case.) In these cases, I just instruct the schema to interpret the empty string as an empty dict/list. This is pretty easy and I think fits well with the overall philosophy of nestedtext. I like the syntax you came up with, but I think the disadvantages of adding it would outweigh the advantages.

Perhaps it would be more parsimonious to also apply the same thinking to the top-level data structure. Our thinking as I recall it was that type ambiguities at the top level would be more likely to cause problems than at lower levels, and that it's generally good if function return types don't depend on user input, so it made sense to provide a means for the programmer to specify the top level type.

LewisGaul commented 3 years ago

Yeah, I can see the reasoning, and I think I'm convinced that adding my proposed syntax isn't worth it.

I would like to remove the ambiguity over the empty file though - I'm not particularly convinced by the argument against that.

I've just got zig-nestedtext running against the testsuite and discovered the empty1 testcase which expects reading in an empty NestedText file to dump out as null in JSON. If a NestedText parser is supposed to treat an empty file as any of 'empty string', 'empty list', or 'empty object', then how is that supposed to get back to a 'null' in JSON? Surely there isn't an expectation that empty strings get translated to null?

In truth, I originally had zig-nestedtext parsing an empty file as a null object, but this seems fundamentally wrong as it effectively corresponds to an empty file being absence of a NestedText value, when really it should be interpreted as an empty value. I've now changed it and had to arbitrarily pick that it should be a string.

nestedtext.Value type Selecting string type for empty file

it's generally good if function return types don't depend on user input

I think this statement is perhaps based on familiarity with Python - it doesn't apply to languages that have union types such as Zig, where I'm having to arbitrarily pick one of the types (or could let the user specify, but there seems to be little value in doing so in the absence of applying a schema).

kalekundert commented 3 years ago

Regarding empty files: I agree that the most logically consistent thing to do would be to parse it as an empty string. The problem is that—perhaps only in dynamically-typed languages like python—this also creates a bug magnet. I think its fair to assume that the top-level data structure in most applications will be a dictionary. The developers for such applications are expected to provide a schema detailing how to interpret that dictionary, but it's easy to forget that that schema needs to explicitly include an instruction to interpret an empty string as an empty dictionary. Without this, the application will crash given user-input that was probably meant to be valid.

To be a little more concrete about the problem, consider pydantic. I believe that pydantic is the most popular schema library for python at the moment. However, it requires that the top-level data structure is a dictionary and has no way to specify that an empty string at the top level should be treated as an empty dictionary. Therefore, this would require an application developer using nestedtext in conjunction with pydantic to manually check if the output from nt.load() is an empty string, and to manually convert it to an empty dictionary if so, before passing it on the pydantic. We regarded this as simply too much of a gotcha/too much boilerplate for an overwhelmingly common use case, so we eventually decided to tweak the spec and the python API to allow the application developer to specify the top-level type.

Perhaps a better way to word the spec would be to say that the type of an empty document is implementation-dependent, rather than saying it is unknown. After all, we chose the behavior we did to accomodate a specific problem in python. But I can see that it could make sense to treat an empty file as an empty string in a statically-typed language that could force the programmer to handle the top-level string case.

LewisGaul commented 3 years ago

Thanks for explaining the thought process, I can see where you're coming from. FWIW I was already interpreting the spec as saying the type of an empty file could be implementation dependent, so that rewording makes no difference to me - I'm just not a big fan of there being undefined behaviour in the spec.

Maybe the spec could be reworded just slightly to say something like "by default an empty file represents an empty string, but an implementation may choose to allow the user to specify the root document type as list/dict, in which case an empty file would be interpreted as an empty list/dict"? Admittedly, having the Python implementation then default to the dict type (which I understand you're saying is desirable) might then be considered a bit dubious. Something in this direction would be good from my perspective though.

Can we do something about the empty1 case in the testsuite I mentioned? I can't see any way it makes sense for an empty NestedText file to end up translating to a JSON null. I think either the type of an empty file should be well-defined and map to one of "", [], or {}, or this testcase needs to be removed since there's no well-defined conversion.

kalekundert commented 3 years ago

Regarding the flow-style syntax: This is actually growing on me. I think @KenKundert is right that allowing [] and {} would invite people to put values in them. I initially regarded this as a deal-breaker, but on further thought I think there are actually some good arguments in favor of adding a single-line list/dict syntax:

I should also be clear about the exact syntax that I have in mind:

There are also arguments against this syntax, although I haven't thought of any yet that I find very compelling:

kalekundert commented 3 years ago

Regarding the empty test case: Yeah, I'm not even sure how that test passes at the moment. I'll have to look more closely at it, but the suggestion to remove it makes sense.

LewisGaul commented 3 years ago

I think I'm +1 on supporting simple flow-style as you propose. I considered this a potential extension to my original proposal - perhaps I should have mentioned it!

The two most compelling points for me are:

Haven't thought too hard about it the delimiter and parsing of whitespace/empty value etc., there's definitely some thought that would be needed there, especially in the absence of quotes. I'd be half tempted to disallow whitespace in keys/value in flow-style (and then have ,_ as the delimiter).

KenKundert commented 3 years ago

Concerning the issue of empty files. They way I think of this is that NestedText supports four data types: string, list, dict, and unknown. The unknown data type only occurs at the top-level and only when an empty file is encountered. In each case, how an implementation represents each of these four types its choice. In the Python implementation I chose str, list, dict, and None. In the NT test cases, Kale chose JSON's string, list, dictionary, and null. Another viable approach would be to use the languages native objects for strings, lists, and dictionaries, and then raise an exception for unknown.

KenKundert commented 3 years ago

This issue has been addressed in v2.0.