Diagnostic notation wrong for floating-point numbers

PJK commented 7 years ago

Hi, I came across the following inconsistency:

cbor2diag ignores width of floating-point numbers and thus is not bijective, as one would expect.

The RFC specifies that the width should be encoded using _n suffix, but it's not, therefore the information is lost, as demonstrated by the following example:

perl -e' print "\xfb\x00\x00\x00\x00\x00\x00\x00\x00"' | cbor2diag.rb | diag2cbor.rb | hexdump

prints

0000000 f9 00 00                                       
0000003

instead of \xfb\x00\x00\x00\x00\x00\x00\x00\x00.

Moreover, the parser doesn't seem to support parsing the _ width suffixes at all:

pavel@dt3:~|⇒  echo 1.0_2 | diag2cbor.rb
*** can't parse 1.0_2
*** Expected one of [0-9], [Ee], [ \t\n\r], "/" at line 1, column 4 (byte 4) after

This in general goes back to the design of Diagnostic Notation which doesn't specify the exact rules for encoding 'logical' values to CBOR items. The same problem occurs with e.g. lengths of all items, where this implementation also chooses the 'canonical' encoding over preserving the full information:

pavel@dt3:~|⇒  perl -e' print "\x98\x00"' | cbor2diag.rb | diag2cbor.rb | hexdump  
0000000 80                                             
0000001

If these are the intended semantics of DN, then the standard should be amended with unambiguous specification of 'Canonical' CBOR, which would render section 6.1 unnecessary.

Otherwise the implementation should be changed to at least allow parsing all valid DN inputs. In this case, cbor2diag 'composed' with diag2cbor can never be an identity unless DN is extended to allow specifying width in all constructs where multiple choices produce valid and logically equivalent item.

jyasskin commented 7 years ago

cbor2diag.rb also loses the fact that integers are encoded with a non-minimal encoding. This comes at least partially from the fact that the cbor decoder represents integers and floating numbers as native Ruby types, rather than including their size information.

cabo commented 7 years ago

Indeed, cbor-diag is based on cbor-pure, which maps the CBOR data model to Ruby data structures. Adding meta information such as encoding sizes is not easy in the latter (I have a partial solution, but there are some Ruby bugs around floating point numbers). Maybe we need to go away from actual Ruby data here and treat integers and primitives the same way we already treat unknown simples.

So far, I haven't needed these capabilities -- what is your use case that requires full roundtripping?

jyasskin commented 7 years ago

In https://github.com/dimich-g/webpackage/pull/50/files#diff-04c6e90faac2675aa89e2176d2eec7d8R128, I'm requiring that the trailing length be encoded in 8 bytes, so that tools parsing from the end of the file can tell whether h'1b0000001a00191817' encodes 111670794263, 1644567, 6167, or 23. I'd like the diagnostic format to be able to diagnose mistakes in that encoding and to produce that encoding.

cabo commented 7 years ago

I see. You would get the same effect from using any CBOR encoder, including those that don't give you control over encoding lengths, by just describing that field as a byte string of eight bytes and then just saying at the semantic level that this is a length encoded in network byte order. Small cognitive dissonance, much better leverage of the existing CBOR ecosystem.

jyasskin commented 7 years ago

Yeah, that's fair. Another use, which I don't need yet, is for cases that require Canonical CBOR. If an item fails to parse due to being non-canonical, the diagnostic form should make it possible to distinguish that item from its canonical form.

cabo commented 7 years ago

The above thread notwithstanding, it should indeed be possible to use this gem to get more representation information. I will try to extract some code from the cddl gem to use here, but it will probably take a week or two until we have something to play with.

cabo / cbor-diag

Diagnostic notation wrong for floating-point numbers #6