YAML's main issue, semantic whitespace, still a problem

brunoborges commented 3 years ago

Subject says it all.

KenKundert commented 3 years ago

Can you give me an example?

KenKundert commented 3 years ago

Oh, you mean hierarchy enforced by indentation? Yes. It is a baked-in design choice. So if that is your issue, you can close this issue now as it is not changing.

ndvo commented 3 years ago

Perhaps @brunoborges is referring to interpreting blank values after the colon as the field value as described in the documentation and copied bellow.

    treasurer:
        name: Fumiko Purvis
        address:␣␣
            > 3636 Buffalo Ave
            > Topika, Kansas 20692

The fact that a second space is meaningful is counterintuitive and will certainly lead to problems. Empty values could be represented by:

nothing after the colon and
the end of file or new item in the next line.

KenKundert commented 3 years ago

I struggled with this issue for a while. Eventually I decided that anything beyond ':␣' is the value. That is the simplest and most flexible behavior. If that is not the case, I don't think you can specify a string value that contains only whitespace.

ndvo commented 3 years ago

I don't think I can argue with that, but it would still be possible to have a string of blank characters using the syntax for multiline using a single line. It is certainly not simpler though.

KenKundert commented 3 years ago

Yes, you are correct. Either way there are trade-offs. My feeling is that most people will think that the way dictionaries, lists, and rest-of-line strings are specified is very natural, but the way multi-line strings is specified is artificial or arbitrary, or at least unfamiliar. So I am trying to avoid forcing people to use them except in cases were they are obviously required.

prescod commented 3 years ago

Having values that consist entirely of whitespace is quite unusual, so I don't see a problem with that special case using a slightly less intuitive syntax. It remains to be seen whether leading spaces will cause confusion in practice but I will note that "extra-whitespace problems" are some of the least intuitive problems to debug.

"Error: Cannot find Foobar named Xyzzy".

pierre-rouleau commented 3 years ago

I also think that having to use space characters at the end of a line to identify a string of whitespace will likely cause issues when people will have their editor automatically strip trailing whitespace off.

Would it not be acceptable to request using quotes when a string must contain leading or trailing whitespace?

KenKundert commented 3 years ago

Except in keys, quotes are not treated special. If you add quotes, they become part of the value. However, the application that incorporates a NestedText reader is free to strip off excess white space if it feels that is appropriate. In other words, NestedText is trying not to have any expectations of what a value should be other than it being a string of characters that continues to the newline.

ndvo commented 3 years ago

As it is now, Nested Text uses a very simple solution: no exceptions to the rule that after : comes the value. There is great value to this. I can see how this could lead to problems with unintended extra space, but I'd like to add that:

editors can very easily highlight the extra spaces, minimizing this issue
this is basically the only syntax issue with Nested Text, and therefore the first one would look at when facing an unexpected result.

A better solution would need to be similarly simple. Perhaps there isn't one.

pierre-rouleau commented 3 years ago

Would it not be possible to have another separator prefix to identify a quoted value? The key:value pair could, for example also be represented by a key="value"

pierre-rouleau commented 3 years ago

I am not suggesting to replace the due:value pair syntax, but by adding another possibility of key="value". So 99.9 %s of the case where whitespace is not significant, the simpler key:value syntax would be used. When someone wants to identify explicit whitespace then they could use key="value" and perhaps key= "value"

The overall syntax would be explicit. and you would retain the ability/benefit of your original value:pair syntax, to the expense of a little bit more code to handle the special case.

Depending on editors to handle syntactic concerns would kind of defeat and contradict the benefit of using your proposal, wouldn't it?

prescod commented 3 years ago

@pierre-rouleau Would your proposed syntax handle nested quotes?

pierre-rouleau commented 3 years ago

@prescod It all depends how the new, additional, key="value" syntax would be treated.

Originally I was thinking that it would be only used to allow the explicit creation of values that contain leading and/or trailing whitespace(s).
Since the content is already line-based, then I see 2 possibilities:
- 1: It would also be possible to state that the value starts after the first quote and ends with the last quote on the line, allowing anything in between, unchanged. That would mean allowing embedded double quotes, without the need for an escaping mechanism.
- 2: It is also possible to say that the new syntax would end on the line, but anything between the double quotes would support an escaping mechanism. Then to embed a double quote you would need to escape it. For example with a leading backslash. That would mean that:
  - A backslash would require escaping (i.e text = " Space enclosed string with one backslash: \\ " )
  - It would also become possible to specify binary characters (ASCII hard tab, BEL character, some other binary value escaped with a some syntactic mechanism similar to what's available in various programming languages).

Of course this would not look as cleanly as the original key:value syntax, that would still be available for the majority of the use cases. The new key="value" would just provide a mechanism to handle the cases more difficult to handle, with an explicit syntax. That new key = "value" syntax would be a special case. The fact that it starts with the '=' character would just mean the interpretation of the remainder of the line differs.

pierre-rouleau commented 3 years ago

@prescod BTW, allowing escaping inside the key = "value" line would mean that you could include things like non-breaking space (U+00A0 No-BREAK-SPACE Unicode character) as much as anything, all depending on the escaping mechanism that would be supported.

KenKundert commented 3 years ago

We don't see NestedText as the whole solution to the 'data storage that allows human interaction' problem. Rather, it is the piece that provides the ability to store structured data without interpreting that data in any way (other than to extract its structure). This is why we don't throw away characters (stripping whitespace), nor do we treat characters special (escaping). In our mind, that is the job of the application that receives the data. Specifically, I don't see anything propose here that could not be implemented in the end application if it is desired.

This is kind of a new concept, so we would like to give things time to play out before we start changing things.

ethanfurman commented 3 years ago

I see this as a serious issue. Human readable and significant leading/trailing whitespace in a string do not go together.

I see two possible solutions:

less preferred: if the value already assigned is composed entirely of white space, and a nested value is encountered, replace the white space value with the nested value (this solution still has the serious problem that empty strings and whitespace strings look the same)
more preferred: strip off one pair of surrounding quotes test: "" ==> an empty string test: " " ==> a single space test: "hello" ==> the string hello (without quotes) test: ""hello"" ==> the string "hello" (with quotes) test: """hello""" ==> the string ""hello"" (with double quotes) test: "hel"lo" ==> the string hel"lo (with a single embedded quote)

Ah, I see @pierre-rouleau already suggested this.

Please make this change. I really like the NestedText idea, but will never use/support this format while it contains this flaw.

(Thank you @KenKundert for considering this.)

KenKundert commented 3 years ago

Human readable and significant leading/trailing whitespace in a string do not go together.

I find that statement too vague. Precisely what problem are you trying to solve?

KenKundert commented 3 years ago

All of your examples seem to be for end-of-line strings. How does your proposal affect multiline strings?

KenKundert commented 3 years ago

How would you propose to handle the following cases:

key:␣␣␣value␣␣␣
key:␣␣␣"value"␣␣␣

Presumably you would strip the leading and trailing white space in both cases and end up with {'key': 'value'}.

key:␣␣␣"value␣␣␣
key:␣␣␣value"␣␣␣

In these cases, would you get {'key': '"value'} and {'key': 'value"'}?

ethanfurman commented 3 years ago

Human readable and significant leading/trailing whitespace in a string do not go together.

I find that statement too vague. Precisely what problem are you trying to solve?

The second example from the gotchas in the documentation: trailing spaces meant the value was a single space, causing an error when the intended value was read on the next line(s). The point being that a human looking at those lines cannot see the trailing spaces (and IDE support is not always available).

All of your examples seem to be for end-of-line strings. How does your proposal affect multiline strings?

The same: if there should be actual white space in a multiline string then it should be quoted:

address:
    > line 1
    >
    > final line

would result in {'address': 'line 1\n\nfinal line'} while

address:
    > line 1
    > "   "
    > final line

would result in {'address': 'line 1\n \nfinal line'}.

How would you propose to handle the following cases:

key:␣␣␣value␣␣␣

key:␣␣␣"value"␣␣␣ Presumably you would strip the leading and trailing white space in both cases and end up with {'key': 'value'}.

Correct.

key:␣␣␣"value␣␣␣

key:␣␣␣value"␣␣␣

In these cases, would you get {'key': '"value'} and {'key': 'value"'}?

Yes.

Thinking through this some more, I would suggest that lack of a leading/trailing double-quote means:

leading/trailing white space is ignored
all characters are literal (no escapes)

If a leading and a trailing double-quote is found, then:

enable string-processing
- retain any quoted leading/trailing white space
- allow escapes, such as \t for tab or \x20 for a space, etc.

I think this would give us the best of both worlds -- simplicity of reading/writing for the majority of cases while allowing finer control when necessary.

KenKundert commented 3 years ago

So with the proposal, the single space after a -, : or > becomes at least one space, and every indentation must be quoted? So for example, in the current version we allow:

>    Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
> tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
> quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
> consequat.
> 1. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum
>    dolore eu fugiat nulla pariatur.
> 2. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia
>    deserunt mollit anim id est laborum.

But that now becomes:

> "   Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod" 
> tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,   
> quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo        
> consequat.                                                                     
> 1. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum       
> "   dolore eu fugiat nulla pariatur."                                          
> 2. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia   
> "   deserunt mollit anim id est laborum."

I am not crazy about that. Not only is it a pain to quote the indented lines, but it messes up the vertical alignment as you have an extra character at the front of some lines that will be deleted.

ethanfurman commented 3 years ago

That's a good point. So the first space is trimmed, any subsequent unquoted leading space is retained -- after all, it can be seen.

kalekundert commented 3 years ago

One comment I'd like to make is that, in my own use cases, it's not uncommon to have values that are quoted (and that would have to be double quoted with the proposed syntax). To be specific, one of the things I used NestedText for is to store inputs for unit tests, and sometimes those inputs take the form of Python literals meant to be eval-ed inside the test case. String literals, of course, are quoted. It's also not too hard to imagine values representing some sort of dialog or quotation starting and ending with quotes. Since one of the central goals of this format is to avoid YAML-esque rules about when quotes have to be added or escaped, it makes me uneasy to use such a common syntax for delimiting strings.

An alternative syntax is to use a character (e.g. <) to mark the end of a string. The rule would basically be: strip all righmost whitespace characters. If the rightmost remaining character is '<', strip it as well. The hope is that this syntax would be less likely to conflict with real-world values. Some examples:

Typical use:

indent:␣␣␣␣␣<
# {'indent': '␣␣␣␣'}

Preserve trailing '<':

indent:␣␣␣␣␣<<
# {'indent': '␣␣␣␣<'}

I don't think it would be a good idea to apply any sort of per-line quoting to multiline strings. Multiline strings in general (e.g. in YAML, TOML, python, etc.) don't strip internal whitespace and don't support syntax to explicitly indicate where internal lines end. Quotes also have the potential to cause headaches for multiline strings containing prose. It's not uncommon for prose to contain a lot of trailing whitespace, because editors can use the presence of this whitespace to determine where paragraphs start and end (see :set formatoptions+=w in vim). If such a paragraph were to be dumped to a NestedText file, every line would have to be quoted, which would add a lot of visual clutter and make the paragraph hard to edit.

I'm on the fence about the overall idea of adding quotes. Significant trailing whitespace is definitely a usability issue, but any solution will force users to keep in mind quoting/escaping rules that come up only rarely. It's not really clear to me which is worse.

ethanfurman commented 3 years ago

Well, perhaps my "less preferred" suggestion is a better fit then: keep the trailing white space as the value, but if a nested value immediately shows up replace the white space value with the nested value.

treasurer:
    name: Fumiko Purvis
    address:␣␣
        > 3636 Buffalo Ave
        > Topika, Kansas 20692

address becomes """3636 Buffalo Ave\nTopika, Kansas 20692""". (I don't know if NestedText multiline strings include a trailing new-line, so I didn't add one in that example.)

ethanfurman commented 3 years ago

I think the key question here is: do you expect people to write NestedText files? Because if not, then the problem is unlikely to occur unless there is a bug in the library code that is doing the writing -- of course, there will still be the occasional problem of why what looks like an empty field evaluates as True (an empty string is falsey, but a string with white space is truthy).

KenKundert commented 3 years ago

I don't understand why you think this particular example illustrates a serious issue. It should occur only rarely and the error message shows the problematic line with an EOL marker, so the problem is relatively easy to see from the error message.

This seems like a small problem.
Your proposal addresses this one small problem by injecting a small inconsistency into the language.

The inconsistency is this:

before: a value cannot be specified twice.
after: a value cannot be specified twice unless the first time consists only of white space, in which case the first is ignored.

Neither problem is terrible, but of the two, I would most prefer not to have the inconsistency. In the first case, there is a mistake in your input, you get an error message, you resolve it, and the problem is gone and you never have to think about it again. In the second case, the inconsistency is always there: it has to be documented, it raises questions, it has to be remembered or re-learned.

I am find the hidden white space at the end of the line problem more compelling, and Kale's proposal is a good one for end-of-line strings, but as Kale himself points out, it is problematic for multi-line strings. However, I'm not in favor of it for two reasons.

If we adopt it only for end-of-line strings, then the two types of strings are inconsistent. One can think of a multiline string as a collection of string items, each begins with a string tag ('> ') followed by an end-of-line string. That way of thinking exposes an obvious inconsistency: the end-of-line strings that are part of a multiline string do not treat the '<' at the end of line as special whereas the individual end-of-line strings do.
Right now all the values specified in NestedText are taken literally. There is no quoting and no escaping. Every character is taken as is without interpretation or judgement. I think this is an important strength of the language. It contributes to its simplicity. Treating '<' at the end of the line as a special case solves a real problem, but also injects an inconsistency that increases complexity and detracts from the language. It prevents us from saying every value is taken literally. My feeling is that in sacrificing that consistency the cure is worse that the problem.

At this point I am thinking the language should remain as it is. However, I could try to improve the error message. For example, I could replace leading and trailing spaces in the value of the displayed lines with a ␣ symbols. In that way this particular error message becomes:

invalid indentation.
   4 «    address: ␣»
   5 «        > 3636 Buffalo Ave»
          ▲

Or perhaps I could distinguish this form of invalid indentation error from others and give a more specific message. Are these sufficient?

ethanfurman commented 3 years ago

I think a specific error message would solve the problem -- maybe something like:

attempt to replace initial value of " " on line 4.
   4 «    address: ␣»
   5 «        > 3636 Buffalo Ave»

KenKundert commented 3 years ago

Okay, I have refined the error message.

error: test.nt, 6:
    invalid indentation. An indent may only follow a dictionary or list
    item that does not already have a value, which in this case consists
    only of whitespace.
           5 «    address:  »
           6 «        > 3636 Buffalo Ave»
                  ▲

With that I believe this issue is closed. Thanks for your feedback.

ethanfurman commented 3 years ago

Thank you for listening!

AndydeCleyre commented 1 month ago

To this closed issue, I'll add my two cents.

The main problems for me here are that:

I generally have my editors trim trailing white space
I generally can't see trailing white space when displaying files

The best solution I can think of so far:

Add a new multiline string tag, which must always have a matching end-of-line tag. A single multiline string could use either tag for each line.

Let's say | is the start tag and < the end tag. Then the following YAML:

matches:
  - trigger: ":ifm"
    replace: "if __name__ == '__main__':\n    "
  - trigger: ":p3"
    replace: "#!/usr/bin/env python3\n"

could be written as the following FutureNestedText:

matches:
  -
    trigger: :ifm
    replace:
      > if __name__ == '__main__':
      |     <
  -
    trigger: :p3
    replace:
      > #!/usr/bin/env python3
      >

EDIT: It might be worth noting that this wouldn't do anything about the super edge case of trailing white space on a line of a multiline key. Though theoretically it could gain the same ability if you allow the | & < lines for those . . . or possibly with a : instead of the <:

: key 1
:     the first key
| still the first key          :
    > value 1
: key 2: the second key
    - value 2a
    - value 2b

KenKundert commented 1 month ago

This suggestion seems to add significant complexity to address what is fundamentally an editor issue. Personally I configure my editor to show trailing white spaces but not to automatically delete them on the off chance that I want them. In my experience, the only time I want them is when entering long lines where I am using a trailing space to indicate that the line should be joined with the one below it.

Perhaps you can consider disabling the delete trailing white space feature of your editor when editing NestedText files.

AndydeCleyre commented 1 month ago

Good advice, but I'll note it doesn't address normal cat-ing or paging in the terminal and not seeing the spaces -- nor the problem of other folks opening and saving the file with their editor settings, without realizing the changes they've made to lines they may not have even looked at.

KenKundert commented 1 month ago

Yes, I acknowledge that, but I believe significant trailing white space are unusual and the extra clarity that this change would bring would be worth the increase in complexity in NestedText.

pierre-rouleau commented 1 month ago

I must say I'm a little confused now. Is the solution to keep significant trailing space?

KenKundert commented 1 month ago

Whoops, I dropped a word.

Yes, I acknowledge that, but I believe significant trailing white space is unusual and the extra clarity that this change would bring would not be worth the increase in complexity in NestedText.

pierre-rouleau commented 1 month ago

Thanks for clarifying, but to me it means that NestedText suffers from a hidden flag that's waiting to bite. There's already so many. The implementation might be simpler, but users might suffer. I see it as transferring the responsibility of ensuring reliability from the implementation to the user. I don't need another easy-to-forget detail. The problem is not editors, or VCS, or diff tools configurations. Too bad. I'll probably stay away.

KenKundert / nestedtext

YAML's main issue, semantic whitespace, still a problem #4