agronholm / cbor2

Python CBOR (de)serializer with extensive tag support
MIT License
225 stars 58 forks source link

Preferred Serialization and Canonical encoding in CBOR #28

Open Sekenre opened 6 years ago

Sekenre commented 6 years ago

The CBOR mailing list has been discussing the definition of Canonical in the standard and have been making changes. I wanted to document these and maybe discuss how they might be implemented in cbor2.

Updated draft standard: https://datatracker.ietf.org/doc/draft-ietf-cbor-7049bis

What constitutes a true canonical encoding can be redefined at will by protocol implementers, CBOR standard provides guidelines.

A general encoder/decoder like cbor2 will need to support a number of variations and validate them.

Constraints which have been discussed on the ietf mailing list and in the updated draft:

There may be tagged arrays created for fixed length binary encodings of float values. (Tag values TBD)

See: https://datatracker.ietf.org/doc/draft-ietf-cbor-array-tags/

Decoders may need to validate these by raising errors if the following conditions are met:

Instead of a single canonical=True argument there needs to be separate flags for each potential constraint.

For example, if a device expects only 16bit floating point data you could create the encoder like this:

encoder = CBOREncoder(f, float_format="binary16")
encoder.encode(data)

Or for a minimal float encoding and sorted maps using the encoded length

encoder = CBOREncoder(f, float_format="minimal", sort_maps=True, sort_by_length=True)
encoder.encode(data)

On the decoding side:

decoder = CBORDecoder(f, validate_floats_as="binary16")
result = decoder.decode()
decoder = CBORDecoder(f, validate_floats_as="minimal",
    validate_map_order=True,
    ordered_by_length=True,
    ignore_duplicate_keys=False)
result = decoder.decode()

Of course these argument names and the way they are set up are just intended as an example.

agronholm commented 6 years ago

At this point I think you should become the maintainer of cbor2. How about it?

Sekenre commented 6 years ago

I'd be honoured! I am inexperienced but very keen :grinning:

fsssosei commented 6 years ago

The code seems to be a long integer time complexity O (n ^ 2)? Let's say I have a 10,000 bit integer

agronholm commented 6 years ago

What code where?

Sekenre commented 6 years ago

The code seems to be a long integer time complexity O (n ^ 2)? Let's say I have a 10,000 bit integer

Hi @fsssosei could you open a new issue for this and I will look at it when I get the chance?

henryk commented 2 years ago

To add some food for thought: Here is an article discussing the different "canonical" encodings in CBOR: https://www.imperialviolet.org/2022/04/17/canonsofcbor.html It also proposes a naming scheme. RFC 7049 seems to describe "three-step" ordering (but could be read ambigously), RFC 8949 describes "one-step" ordering.

Best as I can tell, cbor2 currently implements three-step ordering. For starters, the documentation could point out the different ways a "canonical" CBOR can be canonical, and document the current state in the library.

Sekenre commented 2 years ago

Absolutely. Thanks @henryk for bringing that article to my attention. I have been playing around with splitting the canonical settings into their own options, i.e. the 3 options for map ordering, fixed or variable-sized floating point. Etc. Then have a backwards compatible default.

Of course it's easy in python, harder in C. I don't think I will implement validating whether something is canonical, but will document how someone could do so.