Tagless multiline strings

MichalMarsalek commented 2 months ago

Two years ago a created a nested format with strings, lists and dictionaries for my personal use that was surprisingly similar to NestedText. While it was a bit less elegant than NestedText and still required some escaping, there was one thing that was imo nicer than in NestedText - it supported unannounced multiline strings. In my understanding from quickly reading the docs, I don't think there would be any ambiguity in allowing

key: a dictionary value
     that spans
     multiple lines
another key: another value

and

- a list value
  that spans
  multiple lines
- another value

as shorter alternatives to

key:
  > a dictionary value
  > that spans
  > multiple lines
another key: another value

and

-
  > a dictionary value
  > that spans
  > multiple lines
- another value

I'd love if this was added into the syntax if I'm not missing any clashes.

PS: I added NestedText to https://en.wikipedia.org/wiki/Comparison_of_data-serialization_formats .¨ EDIT: Well it was reverted. Apparently it needs a Wikipedia page first.

KenKundert commented 2 months ago

Michal, We considered this approach to multiline strings when we were first coming up with NestedText and rejected it for the following reasons:

The indentation for the multiline string value in a dict item would be excessive for long keys. This causes two problems. If one was trying to maintain a reasonable text width, the multiline string would be squashed into the few remaining columns. Also, each value in a dictionary would have an indent level that depended on the length of its key. For example:

       This is a very, very, very, very, very, very, very, very long key: Due to
                                                                          the
                                                                          indent
                                                                          this
                                                                          value
                                                                          starts
                                                                          in
                                                                          column
                                                                          75 and
                                                                          so can
                                                                          can
                                                                          only
                                                                          contain
                                                                          one or
                                                                          two
                                                                          words
                                                                          in
                                                                          each
                                                                          line.
       another key: Another value. This one has much more space than the 
                    previous one.  Here the string is can be much wider because 
                    the indent is much smaller.  But it might be confusing that 
                    two values in the same dictionary have vastly different 
                    indentation levels.

There is no clear demarcation of the end of the string, so one cannot distinguish between blank lines at the end of the value and blank NestedTest comment lines that follow the strings. We would have to choose between allowing multiline strings that end in blank lines and allowing blank lines as comments in the NestedText file. We could not support both.
I don't believe I can configure Vim to automatically indent the second line of the multiline value in the way needed by your proposal.

The approach we took (using the string tag '>') does not suffer from any of these problems, and also allows us to support blank lines as comments within a multi-line string. For example:

# part one ...
> The first line of the multiline string.

# part two ...
> The second line of the multiline string

As you say, your current proposal does not conflict with the existing syntax, so it would be possible to support both approaches. But we are really trying to keep the language small and simple, and having multiple ways of doing the same thing conflicts with that goal.

Thank you for your suggestion. I understand you desire to find an alternative to the current key-based approach to multiline strings. It seems artificial to me too. But despite spending considerable time on it, I have yet to come up with anything that provides one simple solution to all of the problems I considered.

-Ken

MichalMarsalek commented 2 months ago

Ken, thank you for taking the time to reply.

There's no reason to require the continuation lines' indentation to match the start of the first line of the text value. Just as anywhere else in NestedText, the amount of the added indentation is arbitrary, so a user might chose to align it, but usually you would stick to your normal 2-4 spaces.
```
another key: Another value. This one has much more space than the 
previous one.  Here the string is can be much wider because 
the indent is much smaller.  But it might be confusing that 
two values in the same dictionary have vastly different 
indentation levels.
```
While I'm aware, that > allows you to do that, not allowing comments in the middle of a string value is not something I would consider a dealbreaker. Note that empty lines outside of a multiline string context would not be affected. I don't really know any language that does this (manually joining multiple single line string literals does not count). If a user desires to write comments inside their string values, they can utilize the > approach (which I'm not suggesting to remove - not only for backwards compat but also my suggestion doesn't cover all strings).
See 1.

But we are really trying to keep the language small and simple, and having multiple ways of doing the same thing conflicts with that goal.

Well there are already multiple ways to do the same things. And that's fine. One way that's really short and elegant and works for most cases and another that's a bit more verbose, but works even for all the edge cases (like is the case for dictionaries). When I was thinking about the necessary changes to the (Javascript) parser, I came to the conclusion that it would be reasonably simple. Yes, you would no longer be able to determine the type of each line individually using its tag, but that's fine imo. What you need to do is:

Create a new line type - "string continuation".
When you encounter a line that follows a "list item"/"dictionary item" with "end of line string" value line and the current line is larger indentation, you flag the current line as "string continuation". You can do that since otherwise, an indentation increase is illegal here.
When you encounter a line that follows a "string continuation" line with the same indentation, you flag the current line as "string continuation" line. (If the current line has higher "apparent indentation", you treat the "added" indentation as verbatim spaces and reset the indentation to that of the previous line).
In other cases, parse the line as you do now (including detecting comment and empty lines which is only done now).
You check the indentation levels and combine the lines just as you do now.

The fact that the Pretty printing section even exists in the docs imho really supports my case. The multiline strings serialization is the only thing that is not pretty in NestedText.

MichalMarsalek commented 2 months ago

I just realised that empty/comment lines are still possible inside a tagless multiline string, it just has to be indented less than the string:

key: First line of a multiline string value.
     Second line of a multiline string value.
# Since this line cannot be a part of the string value, it is a comment.
     Third line of a multiline string value.

That being sad I would maybe find this form a bit confusing.

MichalMarsalek commented 2 months ago

I see that there's no .Net library linked. I can write and maintain one. Maybe this weekend. I will probably include this proposal as a toggleable option to experiment with it a bit and see how it works. If my proposal will not end up getting added to the standard, my library will of course by default work as per the standard (not using to emit, throw on parse), but I will be able to use it with this proposal for my personal projects by switching a boolean flag.

LewisGaul commented 2 months ago

I just realised that empty/comment lines are still possible inside a tagless multiline string, it just has to be indented less than the string:
key: First line of a multiline string value.
     Second line of a multiline string value.
# Since this line cannot be a part of the string value, it is a comment.
     Third line of a multiline string value.

Just because you can, doesn’t mean you should :)

From https://nestedtext.org/en/latest/file_format.html:

Each line in a NestedText document is assigned one of the following types: comment, blank, list item, dictionary item, string item, key item or inline. Any line that does not fit one of these types is an error.

A fundamental part of the design is that you can tell what type of line you're looking at without the context of any of the surrounding lines. This holds true whatever you have in your string, e.g. a line: > foo: bar is part of a multiline string, but if you drop the > it could be a dict key-value, and this is only determinable by attributing further significance to the preceding whitespace in the context of previous lines. This fact of the design makes parsing simpler, both for software and for humans.

MichalMarsalek commented 2 months ago

Just because you can, doesn’t mean you should :)

Yes, I agree, as I said I find this confusing.

A fundamental part of the design is that you can tell what type of line you're looking at without the context of any of the surrounding lines.

Sure, I am aware of this, I even mention this in my comment.

It's common for languages to have parser modes. Once you enter a multiline string, some things behave slightly differently and that's fine. I don't think my proposal would make it harder for humans to read as it strives to be very intuitive.

Also, I'm not sure it is that fundamental. Sure it holds (perhaps as a consequence of making sure no escaping is needed) and is kinda convenient, but making an exception seems like nothing fundamental would break in the long run. Well, certainly it is less fundamental than "no scalar types except strings supported" or "no escaping needed".

KenKundert commented 2 months ago

Perhaps I am not being clear. So let me take a different approach.

Consider this example.

key 1: line 1 of value 1
       line 2 of value 1

key 2: value 2

Is the blank line that follows line 2 of value 1 included in the value or not?

In other words, what is the value of key 1? Is it ...

> line 1 of value 1
> line 2 of value 1
>

Or is it ...

> line 1 of value 1
> line 2 of value 1

As a second question, you say that the indentation of the subsequent lines need not match the indent level of the first line. So this is valid:

key 1: line 1 of value 1
    line 2 of value 1

Okay, if that is valid, then how does one represent an out-dented paragraph?

Consider this example:

aa: header ...
    point 1
    point 2

Does this represent:

aa:
    > header ...
    > point 1
    > point 2

or does it represent:

aa:
    > header ...
    >     point 1
    >     point 2

MichalMarsalek commented 2 months ago

The latter. Lines with decreased indentation are never part of the value. They are either ignored (if they are white space only or # comments) or end the value (otherwise).
The former. My proposal is not meant as a replacement of > - I'm aware that it cannot represent all string values.

KenKundert commented 2 months ago

But didn't you say

There's no reason to require the continuation lines' indentation to match the start of the first line of the text value. Just as anywhere else in NestedText, the amount of the added indentation is arbitrary, so a user might chose to align it, but usually you would stick to your normal 2-4 spaces.

Doesn't that contradict your # 1 response?

Isn't that what I did with this example:

key 1: line 1 of value 1
    line 2 of value 1

MichalMarsalek commented 2 months ago

I'm sorry what exactly are you referring to? What are the two statements that are contradictory?

KenKundert commented 2 months ago

The latter. Lines with decreased indentation are never part of the value. They are either ignored (if they are white space only or # comments) or end the value (otherwise).

This is a really difficult conversation because your proposal is so squishy and poorly defined. That is why I am trying to communicate with concrete examples. Allow me to try again. You gave the following example above:

another key: Another value. This one has much more space than the 
    previous one.  Here the string is can be much wider because 
    the indent is much smaller.  But it might be confusing that 
    two values in the same dictionary have vastly different 
    indentation levels.

What does that actually represent? It could be:

another key: 
    > Another value. This one has much more space than the 
    > previous one.  Here the string is can be much wider because 
    > the indent is much smaller.  But it might be confusing that 
    > two values in the same dictionary have vastly different 
    > indentation levels.

Or it could be:

another key: 
    > Another value. This one has much more space than the 
    >     previous one.  Here the string is can be much wider because 
    >     the indent is much smaller.  But it might be confusing that 
    >     two values in the same dictionary have vastly different 
    >     indentation levels.

Which is it? And why should it be one rather than the other? It seems ambiguous to me.

MichalMarsalek commented 2 months ago

Communicating with concrete examples is fine by me, but if you prefer, I can try to write a formal spec instead, although I believe most questions should be clarified by the pseudocode I provided in my comment.

another key: Another value. This one has much more space than the 
    previous one.  Here the string is can be much wider because 
    the indent is much smaller.  But it might be confusing that 
    two values in the same dictionary have vastly different 
    indentation levels.

just as

another key: Another value. This one has much more space than the 
             previous one.  Here the string is can be much wider because 
             the indent is much smaller.  But it might be confusing that 
             two values in the same dictionary have vastly different 
             indentation levels.

is the equivalent of

another key: 
    > Another value. This one has much more space than the 
    > previous one.  Here the string is can be much wider because 
    > the indent is much smaller.  But it might be confusing that 
    > two values in the same dictionary have vastly different 
    > indentation levels.

Anything else would violate

the amount of the added indentation is arbitrary

Especially your other option for its meaning

another key: 
    > Another value. This one has much more space than the 
    >     previous one.  Here the string is can be much wider because 
    >     the indent is much smaller.  But it might be confusing that 
    >     two values in the same dictionary have vastly different 
    >     indentation levels.

doesn't make any sense to me. The spaces just denote a block (indentation). How could they be part of the resulting value? That would imply

another key: Another value. This one has much more space than the 
previous one.  Here the string is can be much wider because 
the indent is much smaller.  But it might be confusing that 
two values in the same dictionary have vastly different 
indentation levels.

is equivalent to

another key:
  > Another value. This one has much more space than the 
  > previous one.  Here the string is can be much wider because 
  > the indent is much smaller.  But it might be confusing that 
  > two values in the same dictionary have vastly different 
  > indentation levels.

which is just some ambiguous nonsense.

KenKundert commented 2 months ago

Okay, so how do you represent the following?

key 1:
    >
    > value 1
key 2:
    > value 2 line 1
    >     value 2 line 2
key 3:
    >
    > - value 3
key 4:
    >
    > value 4: four

MichalMarsalek commented 2 months ago

None of those are representable using the proposed syntax, so you'd fallback to the general > version.

MichalMarsalek commented 2 months ago

You already have two ways to represent strings:

key: string value

and

key:
  > string value

My proposal is to take the one that cannot represent every string (the first one) and make it able to represent some more kinds of strings by allowing the value to continue on the next line (but of course it won't be every string).

Both in the current version and in a version where my proposal is implemented, the following statements hold:

There are two ways to represent a string (some strings).
The first way doesn't cover all strings, while the second does.

So these two points shouldn't be reasons for rejecting the idea.

This is different than https://github.com/KenKundert/nestedtext/issues/34 which (apart from all 3 suggested syntaxes being ambiguous) suggested a third syntax. My proposal just extends one of the syntaxes by making use of a syntax which is currently illegal (a nested block following a list/dictionary item) and thus has no assigned meaning.

KenKundert commented 2 months ago

Okay, I think I understand. Let me repeat back what I understand so that you can confirm that I have the idea. But, be aware that I have generalized it a bit.

You propose enhancing end-of-line strings to allow them to extend beyond one line. Something like this:

key: line one of value
     line two of value
     line three of value

More specifically, if the line after an end-of-line string is indented, it is considered a continuation of the string. The beginning of the second line sets the indentation level for subsequent lines and the value continues until the indentation is abandoned. So the following is valid:

a long key: line one of value
     line two of value
     line three of value

The value in both of these examples resolves to:

> line one of value
> line two of value
> line three of value

Empty lines in the value are indistinguishable from blank lines that are simply ignored by NestedText, so we have to eliminate one or the other. Conceivably it would be possible to disable the discarding of blank lines by NestedText within or adjacent to continuations. Alternately, we can simply disallow empty lines in values rendered with continuations, in which case, we have to decide by fiat whether such strings get a terminal newline.

Tag recognition is disabled after encountering an indent after an end-of-line string (this requires recognizing an end-of-line string, so the first line of the value must not be an empty line). So, tags contained in continuation strings do not present a problem.

The level of indentation is set by the first non-empty continuation line. Subsequent lines may be further indented, indicating an indentation in the value itself.

MichalMarsalek commented 2 months ago

You have the idea, but:

the value continues until the indentation is abandoned.

Yes, but if a line with decreased indentation is encountered that would otherwise be ignored (blank/comment) it is still ignored and doesn't terminate the value.

Conceivably it would be possible to disable the discarding of blank lines by NestedText within or adjacent to continuations. Alternately, we can simply disallow empty lines in values rendered with continuations

Out of these two I specifically propose the former. As long the lines following the value have matching indentation (equal or larger), you disable every line type detection. This includes tags, but also detection of empty/comment lines.

The level of indentation is set by the first non-empty continuation line.

I'm not 100% sure about this, but I'd say it's rather set by the first continuation line with increased indentation (can be empty).

The set of representable strings is: all strings such that:

The first line is not empty.
The second line does not start with a space.

There's also an option to restrict the set of supported strings to values such that each line is nonempty and starts with a nonspace, although I don't really find this necessary.

LewisGaul commented 2 months ago

Ignoring feasibility I just think this is a very bad idea for readability (and simplicity of parsing). The rules for how these strings are parsed are very non-obvious.

MichalMarsalek commented 2 months ago

Ignoring feasibility I just think this is a very bad idea for readability (and simplicity of parsing). The rules for how these strings are parsed are very non-obvious.

The added complexity for the parser is imo not significant. The added complexity for humans is bigger. This can be vastly reduced by using the restricted version:

There's also an option to restrict the set of supported strings to values such that each line is nonempty and starts with a nonspace

In this version everything that's visually connected (no empty lines, strictly equal indentation) is part of the string, everything else (empty lines, different indentation) is not. Would you be ok with this version?

LewisGaul commented 2 months ago

To be honest I'm -1 on the whole proposal, personally I don't see the benefit and I think it has multiple significant downsides, where complexity is a major one for a language that has simplicity as an explicit goal.

MichalMarsalek commented 2 months ago

I think it has multiple significant downsides, where complexity is a major one for a language that has simplicity as an explicit goal.

Can you elaborate on the multiple significant downsides? The only downside (which is far from significant imo - but of course that's subjective) I see mentioned in this thread is the inability to parse each line individually. I can see where the concerns about feasibility and questions about the actual scope are coming from. But regarding complexity, it's really hard for me to see where those concerns are coming from. I'm positive that if you show

- Lorem ipsum dolor sit amet, consectetur adipiscing elit.
- Duis sed condimentum ex. Cras eleifend est ante.
  Suspendisse gravida a enim a imperdiet.
  In vel venenatis purus.
- Vivamus ultrices eleifend maximus.
  Aenean egestas ligula vitae eleifend scelerisque.

to someone (especially a nonprogrammer or a person taht doesn not know NestedText) the meaning will be obvious to them (3 text items, some of them are multiline). It's very familiar syntax that just looks as if someone was taking notes in plaintext.

LewisGaul commented 2 months ago

Can you elaborate on the multiple significant downsides?

Two ways to do the same thing (yes that's technically already true, but this goes from "trivially simple case + general case" to "general case that has a bunch of edge cases that are either ambiguous or simply unsupported + fully general case")
Ambiguity in general around how whitespace should be treated (some examples above), whether comments should work, ...
Additional complexity - dropping "you can tell what type of line you're looking at without the context of any of the surrounding lines" is not something to be taken lightly, it complicates parsing, the ability to describe how the document is interpreted, visual clarity...

Here's another example that becomes far harder to guess what it means with your proposal:

: foo: 123
  > bar: y
: bar: 456
  baz: z

MichalMarsalek commented 2 months ago

Two ways to do the same thing (yes that's technically already true, but this goes from "trivially simple case + general case" to "general case that has a bunch of edge cases that are either ambiguous or simply unsupported + fully general case")

This is just a matter of how you frame it. Me calling it "special case + general case" or you calling it "general case that's not really general + fully general case" does not change what it is.

Ambiguity in general around how whitespace should be treated (some examples above), whether comments should work, ...

There's no ambiguity in the final product. I admint that my original description was not very precise, but I addressed the questions in the meantime.

Here's another example that becomes far harder to guess what it means with your proposal:

This is out of scope. My propsal only covers continuations of rest-of-line list&dictionary values.

LewisGaul commented 2 months ago

I feel like you're missing my point - I'm explaining my reasons for being -1 on the proposal. Even if you don't agree with my perspective on the points I made, they're still my reasons :)

There's no ambiguity in the final product.

I didn't mean "inability to specify how to parse", I meant lack of clarity for end users trying to read nestedtext files. My example snippet was serving to emphasise that point.

MichalMarsalek commented 2 months ago

I am not missing your point - I recognise that these are your personal reasons and that you can think whatever you want. :) I just wanted to make sure that your reasons are based on the properties of the actual proposal and not on some misunderstanding in our communication (I'm not native speaker + my initial description was not precise). Thank you for the elaboration.

I meant lack of clarity for end users trying to read nestedtext files. My example snippet was serving to emphasise that point.

Ok, but you provided an example that is irrelevant to the proposal, which leads me to thinking that my proposal is still not clear.

LewisGaul commented 2 months ago

No hard feelings, don't worry :)

I meant lack of clarity for end users trying to read nestedtext files. My example snippet was serving to emphasise that point.

Ok, but you provided an example that is irrelevant to the proposal, which leads me to thinking that my proposal is still not clear.

This is where I feel like we're not really on the same page. It very much is relevant to the proposal - any time you add complexity (e.g. supporting a new representation) you have to think about existing representations that it takes away clarity from.

To be more explicit about my example and emphasise some more cases there is no obvious "correct behaviour" (which is a sign of complexity and potential confusion for users)...

You're proposing allowing the following:

# new format
A: B
  C

# old format
A:
  > B
  > C

Would you allow this?

# new format
A: B
  C: D

# old format
A:
  > B
  > C: D

This is already a lot less visually clear, a user could reasonably think it's just dodgy indentation that actually represents:

# old format
A: B
C: D

What about this, should it be allowed? (even if you have the answer, it wouldn't be clear to the user)

# new format
: A: B
  C

# old format
: A: B
  > C

This gets even more unclear if you allow something like:

# new format
: A: B
  C: D

# old format
: A: B
  > C: D

What about this, is it valid? A user might reasonably assume so... but it's not very clear to read. Does the blank line matter? What if there are multiple blank lines?

# new format
: A: B

C: D

# old format
: A: B
  >
C: D

The point is that once you allow a new syntax then it makes surrounding syntax less clear, making it harder for users to guess what's valid and/or how to interpret a file.

Given this is valid:

: A: B
  > C: D

It's a good thing that this is invalid:

A: B
  C: D

The two have very different meaning, and the syntax should make that clear.

MichalMarsalek commented 2 months ago

No hard feelings, don't worry :)

Based on past experience I usually worry about that a lot. Online communication is hard. Fortunately we have emoticons. :)

I understand your point now, thank you for explaining with that set of examples. I am still +1, though. :D

KenKundert commented 2 months ago

I must say that I ended up liking this proposal more than I initially expected to. However, I am not currently inclined to add it to NestedText.

There are two things I like about it. First, the format feels very natural. When I take notes for myself I often use a very similar style. Second, it makes NestedText a bit more compact by allowing one to start a multiline string on the same line as a dict key or list tag.

On the negative side is the non-obvious issues with white space. I think that adds significant complexity for the user. Specifically, leading, trailing, and internal empty lines all can cause issues. In addition, there is a weird constraint on indentation. This constraint prevents out-denting, a common idiom. For example, it is not possible to represent the following value:

> ringed planets:
>     Jupiter
>     Saturn
>     Uranus
>     Neptune

It also creates new ways for people to make mistakes that are not flagged as errors. For example, take any dict or list item that contains nested data and add a space or two to its tag and it changes the nested data into a long string. Consider the following (␣ is used to represent trailing spaces):

key1:␣␣
    key11: value11

This is an error in the current version of NestedText because key1 appears to have two values, an end-of-line string " ", and a nested dictionary. But with the current proposal this becomes valid and the value of key1 is a multiline string rather than a dictionary:

key1:
    >␣␣
    > key11: value11

Even without resorting to trailing spaces, Lewis demonstrated that allowing tags in the continuation lines results in some pretty surprising and hard to understand outcomes. I know that allowing tags to continuation lines was my addition, but if you don't allow them you also get some surprising constraints, such as the following:

umbrellas: Mabel the Cat was adamant that Harry recognize the usefulness of
    umbrellas for all wet weather: as protection against rain, sleet, and snow.

This is an error because "umbrellas" has two values, an end-of-line string and a nested dictionary. Removing the : after weather converts this to a valid continued string.

In summary, this new type of string offers a more natural syntax when compared to the existing multiline string. However, it is somewhat constrained and so cannot completely replace it. And its limitation and constraints are non-obvious and can result in surprising and difficult to understand behavior.

The proposal gives me a YaML-vibe. My opinion is that the developers of YaML tried to allow a natural style for representing structured data, but this results in numerous ambiguities that they addressed by adding a seemingly endless number of forms and rules. The result is a language that even experts struggle with. I am trying very hard to avoid this with NestedText.

MichalMarsalek commented 2 months ago

This constraint prevents out-denting, a common idiom. For example, it is not possible to represent the following value:
ringed planets:
    Jupiter
    Saturn
    Uranus
    Neptune

Inability to express this is indeed unfortunate. I don't have a solution for this except suggesting making NestedText require a fixed indentation (specifically 2 spaces so that the continuation lines nicely align with the first line in a list item). That's a breaking change which I understand is extremely unlikely to happen. Or the 2-spaces indentation requirement could only relate to this new form of multiline strings, but that introduces in consistency to the language.

It also creates new ways for people to make mistakes that are not flagged as errors.

That's an inherent consequence of extending a syntax - the set of all feasible strings stays the same, but the set of strings valid in a language grows. Then, inevitably, that loses s level of error detection, because some previous errors are now valid.

I know that allowing tags to continuation lines was my addition

No, that was my idea all along. I tried to explain multiple times that once you are in a multiline context, you disable all tag detection and interpret all lines verbatim (after removing indentation), until you return to a lower level of indentation. Not sure where the idea that I don't want to allow tags in the string value came from.

On the negative side is the non-obvious issues with white space. I think that adds significant complexity for the user. Specifically, leading, trailing, and internal empty lines all can cause issues.

As a last effort in trying to save this proposal in getting lost and forgotten in the history of GitHub issues: Isn't this mitigated by only implementing the restricted version of my proposal, in which only strings where each continuation line starts with a non-space (and thus is non-empty) are allowed in this representation? The only problem I have with this is that currently single line strings starting with spaces are allowed in the shorthand "end-of-line" representation. This is actually something I dislike about current NestedText as it is inconsistent: in shorthand dictionary/list syntax leading and trailing whitespace is not significant but in shorthand text syntax, whitespace suddenly is significant. Why is that? Honestly this feels like it was allowed just because it is not syntactically ambiguous, rather than some conscious design choicee. It would be reasonable to disallow this (another breaking change though), so that the shorthand versions is allowed precisely for strings where each line (including the first) has 0 leading spaces.

KenKundert commented 2 months ago

The restricted proposal basically limits this new form of multiline string to a single paragraph of text, and to avoid issues with colons, this new string probably needs to be able to contain tags.

Presumably this is your intent, correct?

I don't find that tremendously compelling. A whole new form of string dedicated to handling only one specific form of text: a single paragraph. It cannot handle code. It cannot handle multiple paragraphs. It cannot even handle indentation. And it is offered as an alternative to a generic form of string that does a good job of handling all strings without restriction. The benefit of this new form does not obviously outweigh the cost of providing a yet another type of string.

As for your comment about the inconsistency between end-of-line strings and inline strings, I can assure you that both the rest-of-line strings and the multiline strings in NestedText were designed to be a general as possible to avoid the need for quoting and escaping. It was a primary design goal of the language. In both cases it was made possible by simply accepting all characters that follow the tag. This is not possible for inline strings where the strings are embedded in syntax. As a result, the inline strings more restricted than the other two forms. The inline forms are for convenience only and are completely optional. If you cannot live with the restrictions, you simply can avoid the inline forms. Given that inline strings are necessarily restricted anyway, and that the inline forms themselves are optional, it was decided to ignore both leading and trailing spaces on the keys and values to allow people to add extra spaces as they saw fit to make their code more readable.

This last comments also demonstrates that your proposed tag-less multiline strings are not simply end-of-line strings that extend over multiple lines. End-of-line strings can contain any character other than a newline, whereas the tag-less multiline strings cannot contain leading spaces or empty lines. Thus, they are a fourth type of string.

MichalMarsalek commented 2 months ago

Presumably this is your intent, correct?

Yes.

As for your comment about the inconsistency between end-of-line strings and inline strings, I can assure you that both the rest-of-line strings and the multiline strings in NestedText were designed to be a general as possible to avoid the need for quoting and escaping. It was a primary design goal of the language. In both cases it was made possible by simply accepting all characters that follow the tag. This is not possible for inline strings where the strings are embedded in syntax. As a result, the inline strings more restricted than the other two forms. The inline forms are for convenience only and are completely optional. If you cannot live with the restrictions, you simply can avoid the inline forms. Given that inline strings are necessarily restricted anyway, and that the inline forms themselves are optional, it was decided to ignore both leading and trailing spaces on the keys and values to allow people to add extra spaces as they saw fit to make their code more readable.

I think you read my comment backwards. I understand the choice to ignore whitespace in inline strings. What I don't understand is why whitespace is significant in end-of-line strings. As I said, it feels like it was allowed simply because the syntax supports it. It would be perfectly fine to make it insignificant since (paraphrasing what you said): The end-of-line form is for convenience only and is completely optional. If you cannot live with the restrictions, you simply can avoid the end-of-line form (and use the > form). Presumably the inconsistency stems from the fact that end-of-line strings predate inline strings.

This last comments also demonstrates that your proposed tag-less multiline strings are not simply end-of-line strings that extend over multiple lines.

How so?

The value

> abc

is a prefix of the value

> abc
> def

and the representation of the first value as an item of a list

- abc

is a prefix of the representation of the second value as an item of a list

- abc
  def

And there is no other way to represent the "abc" value as a tagless string. Single line strings are a subset of single paragraph strings. There is no fourth type of string here. Single line strings are represented in the proposed "tagless string" representation exactly the same as in the "end-of-line" representation, thus "tagless string" only extends "end-of-line".

KenKundert commented 2 months ago

With inline strings, conventional style requires a space before the value. If one were trying to lines up values between multiple lines the style may result in leading and trailing spaces on both keys and value. These extra spaces are dictated by style and not a desire to represent spaces actually in the key or value. It is for this reason that leading and trailing spaces are ignored with inline strings.

The situation is different with end-of-line strings. If the user adds leading or trailing spaces to a value we must assume they did so intentionally. There is no case where a particular style requires leading or trailing spaces.

This last comments also demonstrates that your proposed tag-less multiline strings are not simply end-of-line strings that extend over multiple lines. How so?

Each of the 4 types of strings have different constraints.

end-of-line strings can contain any character other than a newline
multiline strings can contain any character
inline strings cannot contain leading or trailing spaces, newlines, or []{},
tag-less multiline strings cannot contain leading spaces on any line except perhaps the first, nor may they contain a leading, trailing, or adjacent newlines

Though I guess you can consider tag-less multiline strings as a redefinition of the end-of-line string

tag-less strings can contain any character other than
- leading spaces on any line other than the first
- leading, trailing, or adjacent newlines

From that perspective there would only be three again, as you say.

MichalMarsalek commented 2 months ago

Though I guess you can consider tag-less multiline strings as a redefinition of the end-of-line string

tag-less strings can contain any character other than

leading spaces on any line other than the first

leading, trailing, or adjacent newlines

From that perspective there would only be three again, as you say.

Yes, that's the way I think and speak about it from the very beginning. You even described it like that in your comment 2 days ago.

KenKundert commented 2 months ago

Okay, let me summarize.

You are proposing that we extend end-of-line strings to tag-less strings that have the following constraints:

tag-less strings can contain any character other than
  - leading spaces on any line other than the first
  - leading, trailing, or adjacent newlines

This allows a single simple paragraph to be specified. In this case, simple implies that only the first line may be indented. I believe that this perspective gives the user a simple mental model of what is allowed and so would be considered easy to understand despite what otherwise might be considered non-obvious constraints.

This proposal has the following potential issues:

The parsing may become a bit more difficult. Today each line has a tag that determines its type, with this proposal the type of some lines may depend on the previous line.
Unlike with the traditional NT multiline string, the first line may not line up with subsequent lines.
We create new ways for people to make mistakes that are not flagged as errors. For example:
- extra space converts nested dictionary to tagless multiline string
```
key1:␣␣
    key11: value11
```
- Lewis's examples and other examples like them, such as
```
key: abc
    - def
```

I do not believe any of these issues should be considered fatal to the proposal. Probably the most significant criticism it is that it allows content that looks like valid NT structured data but is not due to a mistake and accepts it while interpreting it a way that completely differs from what may have been intended.

Am I missing anything?

KenKundert commented 2 months ago

The restriction to one paragraph seems a bit heavy. We could reduce impact of that restriction by introducing a new tag, say '+', that combines adjacent tag-less multiline strings into a single string. For example:

umbrellas:
    +     Harry the Dog and Mabel the Cat were having an impassioned
      argument about umbrellas: are umbrellas properly to be used only for
      rain?
    +     Mabel the Cat was adamant that Harry recognize the usefulness of
      umbrellas for all wet weather: as protection against rain, sleet, and
      snow.
    +     "But why limit it, then, to wet weather?" Harry wanted to know.
      "Sun too beats down: is not an umbrella also appropriate protection
      against sun?"
    +
    +     Mabel was having none of it: she remained unmoved.

would be equivalent to:

    >     Harry the Dog and Mabel the Cat were having an impassioned
    > argument about umbrellas: are umbrellas properly to be used only
    > for rain?
    >     Mabel the Cat was adamant that Harry recognize the usefulness of
    > umbrellas for all wet weather: as protection against rain, sleet, and
    > snow.
    >     "But why limit it, then, to wet weather?" Harry wanted to know.
    > "Sun too beats down: is not an umbrella also appropriate protection
    > against sun?"
    >
    >     Mabel was having none of it: she remained unmoved.

But now that I have said it, doing so would make traditional mutliline strings redundant. Anything that could be expressed in a traditional multiline string could be expressed with tagless multiline strings and this new joining tag.

MichalMarsalek commented 2 months ago

Am I missing anything?

I believe you provided a complete summary.

MichalMarsalek commented 2 months ago

Your counterproposal looks interesting, but I don't fully understand it.

Is it possible to use the + tag to extend a value which starts as an end-of-line string? If not, then this is not just reducing the impact, but rather a completely new a different syntax and I don't like it, as finding a way to allow the current the end-of-line values to continue on the next lines is the main goalof this whole proposal. If yes, then I'm curious how, I don't see how it could work. The example
```
- Lorem ipsum dolor sit amet, consectetur adipiscing elit.
- Duis sed condimentum ex. Cras eleifend est ante.
Suspendisse gravida a enim a imperdiet.
In vel venenatis purus.
- Vivamus ultrices eleifend maximus.
Aenean egestas ligula vitae eleifend scelerisque.
```
should remain unaffacted.
What are the conditions in which one needs to start the line with a + rather than just an indentation? I look like whenever the line starts with a space, but also, I think the first line always needs it?

MichalMarsalek commented 2 months ago

I can imagine using + like this:

- First line
  Second line
+ 
  Fourth line

means

-
  > First line
  > Second line
  >
  > Fourth line

It is used at the previous level of indentation to explicitly mark the line as a part of the the value. This solves representing multiple paragraphs, but it doesn't solve leading spaces.

KenKundert commented 2 months ago

It is not a proposal at all, just an observation that I find interesting.

You are correct, this idea does not extend the previous tagless string, it is a way combining an adjacent list of strings into a larger string.

The current tagless string proposal only allows the specification of a single paragraph. One could specify multiple paragraphs by combining several tagless strings in a list, but then you get a list of paragraphs, not a block of text with multiple paragraphs. Hence the the idea of a new join tag that allows one to specify a list of paragraphs and have them combined into a single block of text.

I am not suggesting that we add this new tag because while it can do everything the traditional multiline string can do, in generally it is more awkward. The only time it may be preferred is when one is entering long simple paragraphs and the editor does not support the automatic entry of the leading '> ' on each line.

MichalMarsalek commented 2 months ago

One could specify multiple paragraphs by combining several tagless strings in a list, but then you get a list of paragraphs, not a block of text with multiple paragraphs. Hence the the idea of a new join tag that allows one to specify a list of paragraphs and have them combined into a single block of text.

Ah, you should have said that right away, it didn't really make sense to me but now that you said how you got there, it makes perfect sense. But I agree that it is awkward.

MichalMarsalek commented 2 months ago

I don't think the goal should necessarily be to allow representation of as many strings as possible. There's always the general > syntax one can fallback to. The single paragraph strings such that only the first line can be indented feel like it's an extension that causes minimal confusion to the human parsers while not needing any (explicit) new syntax. It seems that anything else automatically either has problems with ambiguity for the human parsers regarding whitespace or requires more explicit different syntax elements/modes which leads to the format becoming YAML...

KenKundert commented 2 months ago

The dreaded YAML ... We don't want that. ;-)

MichalMarsalek commented 2 months ago

The dreaded YAML ... ;-)

Yeah... we don't want that... that's why I think that if we think supporting whitespace is ambiguous to the users, the only way we can extend at all is to stop at the single paragraph strings without indented continuation lines.

KenKundert commented 2 months ago

Let me take some time on this. I'll get back you within a couple of weeks at the latest. That should give plenty of time for others to comment if they feel the need.

MichalMarsalek commented 2 months ago

Alright, I'll work on the C# library (and then maybe Nim) in the meantime. Whatever the outcome, thank you both for the discussion, it's been very interesting!

LewisGaul commented 2 months ago

To be honest I'm a bit concerned that this would even be considered - it breaks some fundamental properties/design principles that led to my interest in nestedtext.

KenKundert commented 2 months ago

Can you be specific?

AndydeCleyre commented 2 months ago

While not the person you're asking, the one that sticks out to me is being able to identify the type of any given line without referring to its context.

LewisGaul commented 2 months ago

Can you be specific?

I've been explicit and detailed about my concerns in https://github.com/KenKundert/nestedtext/issues/49#issuecomment-2343318162 and subsequent comments, I don't really have anything else to say.

KenKundert commented 2 months ago

Okay, thanks.

Here is another issue that just occurred to me.

If someone is editing an existing NT document and needs to add indentation or an empty line to a value specified as a tagless string, then they will need to convert the whole string to the traditional multiline string from. Thus, a small change would result in a disproportionate amount of effort.

KenKundert / nestedtext

Tagless multiline strings #49