cancergenetrust / cgtd

Decentralized distributed database of genomic and clinical data.
http://www.cancergenetrust.org
Apache License 2.0
40 stars 10 forks source link

Ensure submissions JSON is canonical so hashes are consistent #16

Open rcurrie opened 7 years ago

rcurrie commented 7 years ago

Ravi post GA4GH Vancouver suggest making double sure the JSON we are hashing is canonical. We currently sort the keys so the same data hashes to the same hash. But there still may be several implicit dependencies on how the underlying python generates JSON from a dictionary:

http://stackoverflow.com/questions/4670494/how-to-cryptographically-hash-a-json-object

rht commented 7 years ago

Actually, this should be in the http://ipld.io/ spec,

The IPLD Canonical format is canonicalized CBOR with tags. The canonical CBOR format must follow rules defines in RFC 7049 section 3.9 in addition to the rules defined here. ...

rht commented 7 years ago

@rcurrie I believe the line in question is https://github.com/ga4gh/cgtd/blob/f90a50672a2d3abf3132e8069f791e1a599432ae/cgtd/cgtd.py#L259 (whether json.dumps does a sort on the keys, where it doesn't).

This is to be solved with either 1. (fastest for now?) jsonld.normalize from https://github.com/digitalbazaar/pyld (to avoid confusion, the nomenclature 'normalization' in jsonld actually refers to 'canonicalization' https://github.com/json-ld/normalization/issues/2) or 2. build an ipld object directly (then serialize) or 3. check the solution devised by mediachain (see https://github.com/mediachain/aleph).