bigchaindb / py-ipld

Python implementation of the IPLD specification.
MIT License
14 stars 7 forks source link

IPLD Review #5

Closed dignifiedquire closed 8 years ago

dignifiedquire commented 8 years ago

As mentioned in #1 I've reviewed the code for correctness regarding the IPLD spec.

Currently, all keys are being sorted. It is yet to be determined if this is compatible with other implementations.

This is not compatible with the JavaScript implementation and should not be done, key ordering should be preserved.

https://github.com/ipfs/js-ipld/blob/master/src/cbor.js#L81 Am I right in assuming that this is not supported in IPLD anymore, as links cannot have properties themselves?

Correct, this is a fragment from the earlier version of the spec.

Not sure why all IPLD implementations use sha2-256

The reason for this is that currently all of IPFS uses SHA2 256 as the default hash function.

TimDaub commented 8 years ago

Thanks for your review :+1:

This is not compatible with the JavaScript implementation and should not be done, key ordering should be preserved.

OK, this is a problem then. Try the following. Invert the sort_keys keyword-parameter in https://github.com/bigchaindb/py-ipld/blob/master/ipld/ipld.py#L32 and run the tests multiple times and you'll see that:

Performing list(d.keys()) on a dictionary returns a list of all the keys used in the dictionary, in arbitrary order [...].

I guess a Python expert can give better information here, but as far as I can tell there is no key order in a Python dictionary. It's always completely arbitrary.

/cc Summoning Python experts: @sbellem @vrde @r-marques @diminator

r-marques commented 8 years ago

Yes in python dictionaries have no order. If you want to preserve the order you can use an OrderedDict

TimDaub commented 8 years ago

Yes in python dictionaries have no order. If you want to preserve the order you can use an OrderedDict

Yeah, this would mean though that the user of py-ipld would always have to interact with the API using OrderedDict. I'd like to avoid that.

@dignifiedquire @jbenet Is there a reason why keys are supposed to be cbor'ed by their insertion-order? To me it would actually make sense to change the IPLD specification in that regard, as more programming languages might not care for the insertion-order. Alphabetical order however is something rather unambiguous and easy to implement.

dignifiedquire commented 8 years ago

I think the main goal is to avoid to force developers into using anything they do not need to do for their application. It should be fine if you sort the keys as long as you don't expect them to be sorted as there might be other languages that do not sort them.

I would really like to avoid adding to the spec any kind of enforcement about key ordering as we want to be compatible with as many JSON documents as possible and this would be a major breakage in this regard.

TimDaub commented 8 years ago

I think I'm kind of confused:

Reading section 3.9 of the CBOR spec it says:

  The keys in every map must be sorted lowest value to highest.
  Sorting is performed on the bytes of the representation of the key
  data items without paying attention to the 3/5 bit splitting for
  major types.  (Note that this rule allows maps that have keys of
  different types, even though that is probably a bad practice that
  could lead to errors in some canonicalization implementations.)
  The sorting rules are:

  *  If two keys have different lengths, the shorter one sorts
     earlier;

  *  If two keys have the same length, the one with the lower value
     in (byte-wise) lexical order sorts earlier.

I looked into the cbor node (https://github.com/hildjj/node-cbor) implementation (which is in coffee script :scream:) but wasn't able to conclude if it complies with the section quoted above (- it doesn't look like it: https://github.com/hildjj/node-cbor/blob/master/src/encoder.coffee#L256).

So @dignifiedquire how does the js-ipld reference implementation make sure to be compliant with the IPLD specification (referring to this: https://github.com/ipfs/specs/tree/master/ipld#canonical-format).

For this implementation though, we'll have to make sure that https://bitbucket.org/bodhisnarkva/cbor/src complies with the quoted section above to make py-ipld work, right?

dignifiedquire commented 8 years ago

:blush: well I probably should have read the cbor spec more careful cough

crawls back under his rock

I will check the code cbor code if it sorts, but yes according to the spec you should not need to sort before encoding using the cbor library as it will sort by byte representation.

TimDaub commented 8 years ago

Wikipedia says (super credible source, I know - best I could find for now though):

Sorting a set of UTF-8 encoded strings as strings of unsigned bytes yields the same order as sorting the corresponding Unicode strings lexicographically by codepoint.

So by assuming that this statement is true and since we're retrieving UTF-8 encoded strings from a dictionary (in Python 3 all keys of a dictionary are UTF8), py-ipld is compliant with the cbor rfc section 3.9, hence also with the IPLD spec:

The keys in every map must be sorted lowest value to highest.

Please correct me if I'm wrong!

TimDaub commented 8 years ago

@dignifiedquire Any updates?

dignifiedquire commented 8 years ago

sorry @TimDaub I was on holidays and just starting to work through my backlog. I will get to this tomorrow

TimDaub commented 8 years ago

Hope you had fun :)

The PRs #10 and #11 might also be interesting for you.

dignifiedquire commented 8 years ago

Okay, so it looks like the JavaScript implementation is not sorting the key, nor does it expose a way to do it, this will need to be fixed: https://github.com/ipfs/js-ipld/issues/19

Looking at the python implementation, all things should be a okay when you pass sort_keys=true.

To ensure compatibility between different implementations we should start having a set of ipld objects with expected hashes as reference in a repository so we can pull them down and check against exactly these issues.

TimDaub commented 8 years ago

To ensure compatibility between different implementations we should start having a set of ipld objects with expected hashes as reference in a repository so we can pull them down and check against exactly these issues.

👍

Also, thanks so much for resolving this :)

TimDaub commented 8 years ago

All good here.

Closing this.