cbor / cbor.github.io

cbor.io web site
74 stars 33 forks source link

Indicate what implementations support canonical #21

Closed jkbbwr closed 2 years ago

jkbbwr commented 7 years ago

It would be really handy to be able to know how to use and know which implementations support canonical encoding.

cabo commented 7 years ago

Very few use cases are publicly known that would need canonical encoding, so little attention has been given to that in implementations. Can you talk about yours?

greglook commented 7 years ago

My primary use-case for canonical encoding is in content-addressed storage systems. For example, I want to encode some values into blocks, which are identified by the SHA2-256 of their byte content. If I serialize an equivalent value again in the future, I want it to map to the same hash identifier so that the data is deduplicated.

jkbbwr commented 7 years ago

As listed above. SHA256 of mixed type maps.

ugorji commented 7 years ago

The go library supports canonical: github.com/ugorji/go/codec

cabo commented 7 years ago

Thank you all. I haven't quite figured out how to add this information to this site. However, the IETF CBOR WG will soon embark on an implementation report in preparation for moving CBOR along on the Internet standards track, and that will contain information of this kind. So I'm leaving this as a future enhancement for now.

jkbbwr commented 7 years ago

If you have any say can canonical formatting be part of the specification itself? Not everything has to support it but a canonical serialization format for hashing would be perfect.

greglook commented 7 years ago

The clj-cbor implementation supports canonical encoding, for what it's worth.

cabo commented 7 years ago

@jkbbwr Section 3.9. "Canonical CBOR" is part of the specification already. Maybe it gives a little too much leeway on e.g. floating point encoding. There also have been discussions on whether there should be a canonical encoding with a different sort order for map keys. All this points to having multiple "canonical" encodings, which may violate the POLS (principle of least surprise).

(For spec issues, please go to https://github.com/cbor-wg/CBORbis/issues -- we are doing another run now in the CBOR WG at clearing out editorial issues and other things that need to be done to get "Internet Standard" status.)

greglook commented 7 years ago

Another note on canonical mode that came up while I was implementing the spec: Clojure has sets (unordered collections of distinct elements) as a basic data type, but they don't appear in the CBOR spec. When encoded canonically, set elements should be treated the same way map keys are; that is, written in sorted order so that a set with the same elements is rendered the same each time. This turned out to be cleaner to do by promoting sets to a more first-class treatment than using tagged value handlers.

I think that minimally, the 'set' type should have a standard single-byte tag value (< 24) given the prevalence in language data types. It would also be good to specify the canonical treatment of certain tagged values like sets.

cabo commented 7 years ago

On Apr 14, 2017, at 17:29, Greg Look notifications@github.com wrote:

Another note on canonical mode that came up while I was implementing the spec: Clojure has sets (unordered collections of distinct elements) as a basic data type, but they don't appear in the CBOR spec. When encoded canonically, set elements should be treated the same way map keys are; that is, written in sorted order so that a set with the same elements is rendered the same each time. This turned out to be cleaner to do by promoting sets to a more first-class treatment than using tagged value handlers.

Right. We didn’t include sets in CBOR’s base set of containers because they aren’t in JSON either. Tagging an array as the representation of a set makes a lot of sense for those environments that have these as first-class containers.

I think that minimally, the 'set' type should have a standard single-byte tag value (< 24) given the prevalence in language data types.

Well, we have very few of those single-byte tags left, and since sets are typically large already, a two-byte tag sounds more appropriate.

It would also be good to specify the canonical treatment of certain tagged values like sets.

Yes… all tags would need to define their canonical representation — we didn’t open that can of worms in RFC 7049, but probably would have to for more extensive use of canonical.

Grüße, Carsten