Preferred Serialization and Canonical encoding in CBOR

Sekenre commented 6 years ago

The CBOR mailing list has been discussing the definition of Canonical in the standard and have been making changes. I wanted to document these and maybe discuss how they might be implemented in cbor2.

Updated draft standard: https://datatracker.ietf.org/doc/draft-ietf-cbor-7049bis

What constitutes a true canonical encoding can be redefined at will by protocol implementers, CBOR standard provides guidelines.

A general encoder/decoder like cbor2 will need to support a number of variations and validate them.

Constraints which have been discussed on the ietf mailing list and in the updated draft:

Shortest possible float representations
Fixed float representations
Fixed integer representations
Maps sorted by lexicographical ordering of encoded value (DRAFT)
Maps sorted by ascending length then lexicographic (RFC7049)

There may be tagged arrays created for fixed length binary encodings of float values. (Tag values TBD)

See: https://datatracker.ietf.org/doc/draft-ietf-cbor-array-tags/

Decoders may need to validate these by raising errors if the following conditions are met:

Indefinite length types
Floating point values not in shortest form
Floating point values not in fixed representation
Integers not in shortest form
Integers not in Fixed form
Unsorted maps
Maps sorted with the wrong algorithm
Maps with duplicate keys
Incorrect tag type

Instead of a single canonical=True argument there needs to be separate flags for each potential constraint.

For example, if a device expects only 16bit floating point data you could create the encoder like this:

encoder = CBOREncoder(f, float_format="binary16")
encoder.encode(data)

Or for a minimal float encoding and sorted maps using the encoded length

encoder = CBOREncoder(f, float_format="minimal", sort_maps=True, sort_by_length=True)
encoder.encode(data)

On the decoding side:

decoder = CBORDecoder(f, validate_floats_as="binary16")
result = decoder.decode()

decoder = CBORDecoder(f, validate_floats_as="minimal",
    validate_map_order=True,
    ordered_by_length=True,
    ignore_duplicate_keys=False)
result = decoder.decode()

Of course these argument names and the way they are set up are just intended as an example.

agronholm commented 6 years ago

At this point I think you should become the maintainer of cbor2. How about it?

Sekenre commented 6 years ago

I'd be honoured! I am inexperienced but very keen :grinning:

fsssosei commented 6 years ago

The code seems to be a long integer time complexity O (n ^ 2)? Let's say I have a 10,000 bit integer

agronholm commented 6 years ago

What code where?

Sekenre commented 6 years ago

The code seems to be a long integer time complexity O (n ^ 2)? Let's say I have a 10,000 bit integer

Hi @fsssosei could you open a new issue for this and I will look at it when I get the chance?

henryk commented 2 years ago

To add some food for thought: Here is an article discussing the different "canonical" encodings in CBOR: https://www.imperialviolet.org/2022/04/17/canonsofcbor.html It also proposes a naming scheme. RFC 7049 seems to describe "three-step" ordering (but could be read ambigously), RFC 8949 describes "one-step" ordering.

Best as I can tell, cbor2 currently implements three-step ordering. For starters, the documentation could point out the different ways a "canonical" CBOR can be canonical, and document the current state in the library.

Sekenre commented 2 years ago

Absolutely. Thanks @henryk for bringing that article to my attention. I have been playing around with splitting the canonical settings into their own options, i.e. the 3 options for map ordering, fixed or variable-sized floating point. Etc. Then have a backwards compatible default.

Of course it's easy in python, harder in C. I don't think I will implement validating whether something is canonical, but will document how someone could do so.

agronholm / cbor2

Preferred Serialization and Canonical encoding in CBOR #28