cbor-wg / edn-literal

Application-oriented literals for CBOR extended diagnostic notation
Other
0 stars 7 forks source link

ABNF approach: Special casing h'' and b64'' to be single-layer? #41

Open cabo opened 3 months ago

cabo commented 3 months ago

Rohan proposes that:

The ABNF have all known app strings parsed in a single pass with the rest of the EDN document

Archived-At: <https://mailarchive.ietf.org/arch/msg/cbor/yLhPGvXKC4uBmCEwAb3Kq3ExGGM

"All known app strings" sounds like an extensibility nightmare -- we put in a registry to make this easy to extend. It is also much easier to build a mental model where all foo'' constructs are treated the same way.

More discussion in: Archived-At: https://mailarchive.ietf.org/arch/msg/cbor/iDsBOW-2nKSPPfiIgGWtAmSVWTw

chrysn commented 3 months ago

For one data point, I found this construction conveniently usable. In particular, it spares the downstream parser the hassle of dealing with escaped characters, which admittedly are not required anyway in h the existing app strings.

There is no good reason to explicitly allow h'0\u0030', but unless we want to allow concrete app strings to alter the parsing rules (which I'd strongly discourage), that's better than the alternative of special handling per app string.

[edit: Should probably have gone into mail to the thread.]

cabo commented 3 months ago

Note that, unless the intention is to make two-layer implementations non-conforming, a one-layer implementation can be built from the current ABNF. Munched-up ABNF for this could be recorded on a wiki page or in a reference implementation.

chrysn commented 3 months ago

I'm not sure I understand the implications of a 1-layer approach correctly:

Currently, h'00 / ' / 11' is a syntax error, and h'00 / \' / 11' is allowed (and the comment contains no backslash), and h'00 # foo\n 11' is two bytes because the comment terminates. Would a single-pass grammar change anything about that? If so, can a processor still transform 999("unknown", "a ' b") into whichever is the legal form, both if it knows and if it does not know the rules of unknown?

cabo commented 3 months ago

The problem here is that it is not that hard to write a single-level grammar that captures most of the cases. It is harder to do this in a correct way, and I am not aware of ready-made tools that would support this. Replacing the two-level grammar by an authoritative single-level grammar would create risk that I think we do not need. I'd rather take up the formulation of a single-level grammar as a desirable implementation project (destination: a github repo or another draft) than slow down this specification while trying to manage that risk.

rohanmahy commented 3 months ago

Rohan proposes that:

FYI, this is proposed in #49

rohanmahy commented 1 month ago

I'm not sure I understand the implications of a 1-layer approach correctly:

Currently, h'00 / ' / 11' is a syntax error, and h'00 / \' / 11' is allowed (and the comment contains no backslash), and h'00 # foo\n 11' is two bytes because the comment terminates. Would a single-pass grammar change anything about that?

As the ABNF has been written it does not. I recently added a commit after Joe brought up that bare single quotes in comments had not been allowed even outside of single-quoted strings. That has been fixed.

If so, can a processor still transform 999("unknown", "a ' b") into whichever is the legal form, both if it knows and if it does not know the rules of unknown?

Depending on the rules for quoting a 999-tagged value, an implementation could still turn various tagged CBOR back into app-strings. So I don't think your question is quite correct. The CBOR tstr 78 22 # text(34) 74657374696E67202274657374225C277465737427202F746573742F202374657374 is literally the string testing "test"\'test' /test/ #test to encode that in EDN as a double quoted string you would need to write: 999("unknown","testing \"test\"\\'test' /test/ #test") or unknown'testing "test"\\\'test\' /test/ #test'

rohanmahy commented 1 month ago

The problem here is that it is not that hard to write a single-level grammar that captures most of the cases. It is harder to do this in a correct way, and I am not aware of ready-made tools that would support this. Replacing the two-level grammar by an authoritative single-level grammar would create risk that I think we do not need. I'd rather take up the formulation of a single-level grammar as a desirable implementation project (destination: a github repo or another draft) than slow down this specification while trying to manage that risk.

You are saying that the convenience of the writer of extension specs of EDN is more important that the usability of the grammar for implementers. I think this is a terrible trade off.

I also happen to disagree with your assessment. There are currently only 2 more proposed app strings that I am aware of (e and ref). Neither had a proposed inner ABNF until I took 15 minutes to write one for the single-layer ABNF in edn-e-ref issue #1. The effort required to make this compatible was to add a single boilerplate known-app-str =/ e / ref line and to take the working ABNF for the inside and remove bare single quotes from the allowable characters. The bulk of the work was actually fixing the ABNF to remove unused productions, which would have been required regardless how many layers of ABNF were required.

In summary, there is currently exactly one data point available and it contradicts your argument.