cbor-wg / edn-literal

Application-oriented literals for CBOR extended diagnostic notation
Other
0 stars 7 forks source link

Major change: modify RFC 8610 G.4 (concatenated strings) with an explicit concatenation operator (`+`)? #42

Closed cabo closed 2 months ago

cabo commented 2 months ago

When we discussed the general use of EDN for human input, one desire that came up was to get rid of required commas, maybe the way CDDL does.

This is generally doable (not without a little pain). However, it is incompatible with RFC 8610 G.4 (concatenated strings).

Can we get rid of that feature? (It is not implemented in cbor.me, so I'm too biased to answer this.) Of course, we'd do this to make commas optional right away as well.

If yes, what do we put in instead? "abcd" + "efgh" maybe? (Or any other recognizable "cat" operator.)

Any other surgery needed?

foo 'bar' might become distinct from foo'bar' at least for the barewords we support:

%s"Infinity"
%s"NaN"
%s"false"
%s"true"
%s"null"
%s"undefined"
%s"simple(" S item S ")"

Or maybe we just actually reserve those and make them unavailable for app-prefix.

Ah, the tree of temptation...

chrysn commented 2 months ago

If I disregard any chair hat and procedural considerations: Yes please, that brings things closer to CDDL; it's not like people expect C style concatenation, and it's not the most widely used/supported feature.

(I might even throw in a | or || into the concatenation operator pool, with a nod to cryptography people using it).

Marking the barewords as reserved for app literals sounds doable (especially since the floaty ones are mixed case and thus ineligible anyway; sadly we can't capitalize the others without losing JSON interoperability).

If you decide to make a PR out of that, I think I can crate a branch of my implementation that follows.


Then there is the aspect of roadmap -- sure this puts us back into the WGLC-required phase. I'd consider it worth it, but that's eventually for the ADs (for the push back into the WG) and the WG to decide.

chrysn commented 2 months ago

Before going all-space-is-space, is there any merit to the middle ground where we do introduce a concatenation operator. That way, a grammar update could still later make the commas optional without hitting this particular obstacle, and non-validating consumers can use a comma-free grammar.

As long as we do have mandatory commas, this also simplifies implementations that handle comments, because rather than having (for some t) S t S t S and S t S "," S t S chains we only have S t S delim S t S style chains (with the delim from [,+] or whatever you pick as concatenation operator).

cabo commented 2 months ago

I'm not sure I understand your approach, but I have one observation: There needs to be a operator precedence between "," and "+" (or whatever character we use for that), at least if we want the AST to be useful (which helps implementers immensely).

cabo commented 2 months ago

I made a rough prototype of edn-abnf with explicit concatenation (and optional commas everywhere). You can see it in the edn-abnf PoC. Install the ec variant with:

gem install edn-abnf-ec

(ec stands for explicit concatenation).

You now can compare the output of edn-abnf-ec against that of (unchanged) edn-abnf

The five changes (four to allow optional comma (OC), one for ec) can be seen here:

https://github.com/cabo/edn-abnf/pull/1/files#diff-bc1c8602a

(There are some intermediate compilation results in the repo, these result from the changes in the actual attributed grammar source file .abnftt.)

Which concatenation operator?

I chose + as a separator. This has a slightly weird interaction with the leading "+" we allow in numbers:

$ echo "'a''b'+'c'+1'd'1(0)" | edn-abnf-ec -tdiag - | diag2diag.rb -et 'a', 'bc', 1, 'd', 1(0)

(You would normally write this with spaces to make it readable, like in the output; this would make 'a' + 'b' stand out from +1.)

Testing

I did not test this a lot yet. Against a corpus of examples in RFCs and I-Ds, I find:

Comment about string concatenation from SauerkrautLM:

Here are some common string concatenation operators in various programming languages:

Note that these operators might not be as widely used as the + (plus) operator for string concatenation, but they are still valid and commonly used in their respective languages.

Please vote now ;-)

OR13 commented 2 months ago

I must have stepped out for this part...

I don't love this:

"Herewith I buy" + ... + "gned: Alice & Bob"

I'm not sure if its possible, but could ... have implicit concatenation for generic partials defined?

For example

But then explicit concatenation for none elided instances?

I can imagine choosing .. as concatenation operator would be a nightmare given the ... is used for elision, but just for fun:

"Herewith I buy" .. ... .. "gned: Alice & Bob"
cabo commented 2 months ago
  • "Herewith I buy" ... "gned: Alice & Bob" implicit string concatenation with a string elision.
  • [ 0, 1 ... 8, 9] implicit list concatenation with a list elision.

We could give "..." some additional syntactic sugar. The array example already works, anyway:

$ echo '[ 0, 1 ... 8, 9]' | edn-abnf-ec -tdiag -
[0, 1, 888(null), 8, 9]

However, mixing the syntaxes gets complicated quickly, e.g., with ["a", "f" ... "m", "q"]

$ echo '[ "a", "f" ... "m", "q"]' | edn-abnf-ec -tdiag -
["a", "f", 888(null), "m", "q"]

Today, the ellipsis attaches to the string:

$ echo '[ "a", "f" ... "m", "q"]' | edn-abnf -tdiag -
["a", 888(["f", 888(null), "m"]), "q"]

Which of these is "right"?

OR13 commented 2 months ago

I'm biased by my experience with the "spread operator" ... https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/Spread_syntax

And its support for typescript partials ... https://www.executeprogram.com/courses/everyday-typescript/lessons/partial-in-practice

cabo commented 2 months ago

I'm biased by my experience with the "spread operator"

Right, Ruby has had these for a while as * (arrays and positional parameter lists) and ** (hashes and keyword parameter lists), and of course Scheme has had unquote-splicing ,@ since the dark ages.

But these always have a variable-like thing that provides (or receives) the spread. Maybe we need to separate splicing ellipses from free-standing ones, just like Scheme does. But then, neither unquote (, in Scheme) nor unquote-splicing (,@) attach to neighboring syntactic features. And we don't really want to say how this resolves.

To me, this is mainly about preserving the beauty of "Herewith I buy" ... "gned: Alice & Bob"; this doesn't really generalize (or already does, as with arrays and to a limited extent maps).

chrysn commented 2 months ago

One data point: Section 3.5 of RFC9529 pioneered the comma-free version of EDN in the line that says 5c47bf16df96660a41298cb4307f7eb6' /x/ and is followed by the y coordinate without any comma ;-) (Holding off on reporting that as an erratum there because while the then-current version of EDN had commas, it was not really formal)

OR13 commented 2 months ago

@chrysn this issue seems to be about string concatenation, but your comment is about optional commas.

Is there some implication or interaction you are suggesting? I don't follow.

Edit: your point is obvious, now that I have had a single sip of coffee.

chrysn commented 2 months ago

Making commas optional is the motivating driver for doing string concatenation different (Carsten pointed this out in the top-most item): As long as we have implicit concatenation, commas can't be made optional.

My impression of this issue is that if we really go that way this late in the process, then the commas would become optional in a second change in the same PR.

rohanmahy commented 2 months ago
  • "Herewith I buy" ... "gned: Alice & Bob" implicit string concatenation with a string elision.
  • [ 0, 1 ... 8, 9] implicit list concatenation with a list elision.

We could give "..." some additional syntactic sugar. The array example already works, anyway:

I don't think we want to be making the rules for elision more complicated. Another option is to make elision only work inside h'' and b64''.

"Herewith I buy" + h'...' + "gned: Alice & Bob"

Also, I want to point out that "..." is a map key for selective disclosure JWTs and possibly for selective disclosure CWTs as well. That could make misreading really ugly.

cabo commented 2 months ago

My impression of this issue is that if we really go that way this late in the process, then the commas would become optional in a second change in the same PR.

Well, the change is near trivial.

Of the five small changes in https://github.com/cabo/edn-abnf/pull/1/files#diff-bc1c860

(The attributed grammar in edngrammar.abnftt needs one more change, which is about picking up the right subtree for AST building after inserting elements that need to be counted -- my abnftt grammar does not currently have labels.)