Encoding indicators - Githubissues

cbor-wg / edn-literal

Application-oriented literals for CBOR extended diagnostic notation

Other

0 stars 7 forks source link

Encoding indicators #11

Closed chrysn closed 10 months ago

chrysn commented 1 year ago

My mail at https://mailarchive.ietf.org/arch/msg/cbor/x9xl2lqqSNBK_wtApzo6H6ak8N4 got a bit lost, moving it here to keep track.

Rephrasing what is in there:

EDN without EDN (notably also without <> embedded CBOR) allows round-tripping from CBOR to DN back to CBOR, even when the CBOR is not ideally (which is also deterministically) encoded, provided the CBOR->DN conversion annotates the explicit encoding indicators (at least where it's not using the size the encoder would use).
EDN literals can (currently) not be annotated with bit widths, and that can not be fixed trivially.

Actions which I think would be good are:

State on which EDN literals bit width annotation can be done and on which not. (For example, I expect that for CRIs bit width annotations have no meaningful interpretation, whereas they do for hex or base64 strings).
If the topic of round-tripping ever comes up, point out that not all EDN expressions allow bit-perfect round-tripping by using simpler literals and encoding indicators.

Looking at the details of 8949 encoding indicators, I also found that chunked strings can be expressed by using prefixed strings on every single chunk. Does that capability stay limited to the pseudo-EDN-literals h/b32/h32/b64/base64url, or can new literals go in there if they expand to a string? (That question is not actually new ... was (_ <<1>>, <<2, 3>>) a valid way to write (_ h'01', h'0203') aka 5F4101420203FF). The most straightforward way here is probably to just allow the pseudo-EDN-literals there and be done with it; it's not like we can't still allow it later (it still isn't an interchange format).

cabo commented 11 months ago

On the "streamstrings": I already updated the byte string branch to "bstr" in –04 (was sqstr, and that is too narrow). This syntactically includes app-prefix constructs, but not "embedded" -- the easiest fix would be to add a third alternative "embedded" to bstr.

We don't have much implementation experience with encoding indicators...

cabo commented 11 months ago

Uh oh. I hadn't looked at encoding indicators much for about a decade.

We don't seem to have a way to indicate 1+0 encoding (ai = 0..23). That is a surprising omission. Of course, with preferred encoding, you would never have to say this as the case would be selected automatically. Still...
Arrays and maps are a bit of an exception. I think the encoding indicator was meant to be coming after the opening bracket/brace, but the text requires a bit inference to get there. So it's [_0 1], not [1]_0.

chrysn commented 11 months ago

If "it's too broken, it needs to go" is your conclusion, I'll be a bit sad, but with me lacking sufficient time to provide fixes right now, that may be an outcome.

cabo commented 11 months ago

The "_" can be followed by any \w -- we could simply pull a new convention for 1+0 out of our hats, say, "__" or "_1plus0" :-)

cabo commented 11 months ago

We need examples for tagged — 1(_3 5)? 1_3(5)? 1(5)_3? Fortunately not for simple() -- that is deterministic

chrysn commented 11 months ago

For tagged, the cbor-diag crate interprets it the existing text to lead to 32_0("https://cbor.nemo157.com")

cabo commented 11 months ago

I copied that and completed the set (e.g., <<1>>_0) in PR #15

cabo commented 11 months ago

PR #15 is complete on the ABNF side, with considerable latitude given in the ABNF to what values the *wordchar in spec can take. I'd like to merge this first, and then:

new text should explain what values the *wordchar in spec can take (empty string for indefinite on array and map, 0 to 3 and the new value, second _??, for ai=0..23) and...
how this applies to application-extensions (essentially: the same way it would apply to their output items, but as an opt-in to be defined in detail by each application-extension, which we then need to do).

* We also should define something like a tag 999 for unimplemented application-extensions, as in 999(["dt", "4711"]), as proposed in #13. (Now PR #16)

I don't think we want to have extensive text about round-tripping, but

we could mention that additional information is required to create non-basic diagnostic notation (e.g., b64 or application-extensions) is basing that on additional information. If CDDL is used for that, ~time does this for dt''; but how to decide '' vs h'' vs b64''?
We could also mention that the preferred [sic] way of implementing encoding indicators in cbor-to-diag is to put in encoding indicators only where the encoding is not already preferred encoding.

chrysn commented 11 months ago

I think that that'd be a good "preferred way". Note that for indefinite length encoding, as it's never preferred, it'd mean that it's always rendered explicitly (as (_ 'foo' 'bar') or [_ ...] etc), and that's good. (For <<>> parts and application literals that contain them and have no described structure for their insides, that may mean that they are not used unless the cbor-to-diag tool is configured to ignore those lengths.)

On that additional notation, being ~time is a good indicator (I don't suppose we want DT"" to mean that it carries a tag too). For b64 it could be a ~'d tag 21, but I don't know where the CDDL for it would be best described.