decentralized-identity / sidetree

Sidetree Specification and Reference Implementation
https://identity.foundation/sidetree/spec
Apache License 2.0
437 stars 112 forks source link

Consider JSON Stringify Deterministic #105

Closed OR13 closed 5 years ago

OR13 commented 5 years ago

JSON.stringify() is not safe to use with CAS when the object in question is deep, it is possible to get a different hash for an object that is deep equivalent.

https://www.npmjs.com/package/json-stringify-deterministic

thehenrytsai commented 5 years ago

Hi @OR13, please ignore the incorrect referenced bug fix referenced above.

I do not understand the issue. Currently the CAS layer operates on Buffer and makes no assumptions on how the buffer is constructed. The content (e.g. anchor or batch file) is written only once to CAS by a rooter, no other instances/entities ever need to reproduce the exact serialization/file without fetching the file itself directly from CAS which would yield the same hash.

OR13 commented 5 years ago

https://github.com/decentralized-identity/sidetree-core/blob/761cfd6277a31f16d802ec240e0b16e7b09b72fd/src/BatchFile.ts#L31

Its true that sha256(Buffer.from(JSON.stringify(someDeepObject))) will generate a contentAddress, and so long as some other system does not try and derive the same address from their version of someDeepObject, it will work fine.

I have encountered issues with this in other projects, where a client constructs an object and then attempts to verify its existence in another system by generating a contentAddress for the object.

https://stackoverflow.com/questions/42491226/is-json-stringify-deterministic-in-v8

It may not be an issue with the way sidetree is being used today, I just worry about this sort of thing in protocols / standards.

JSON-LD has a similar challenge, and uses a much more complicated way to achieve a canonical representation of a json document: https://github.com/digitalbazaar/jsonld.js/#canonize-normalize

These things matter more when you have protocol implementations in multiple languages, for example mastodon uses ActivityPub and JSON-LD Signatures, and canonizes their objects using the method above before signing them:

https://github.com/tootsuite/mastodon/blob/cabdbb7f9c1df8007749d07a2e186bb3ad35f62b/app/lib/activitypub/linked_data_signature.rb#L55

See also: https://stackoverflow.com/questions/4670494/how-to-cryptographically-hash-a-json-object

In the future, when sidetree supports Swarm, Storj, Sia, Madesafe, subspace or any of the other content addressing storage systems out there, how you construct the object, will determine the buffer, which will determine what content address you get.

Its possible this is not going to be an issue for sidetree, I'm just sensitive to this issue having been burned by it before, its also one of the main reasons I like JSON-LD (there is standard way of normalizing json documents with a context)

thehenrytsai commented 5 years ago

Thanks @OR13 for the useful info.

In principle, I do like the idea of having a standardized deterministic JSON serializer, but only if it is based on an industry standardized deterministic JSON serialization specification (if there is such a thing); depending a library without such spec can lead to variance/divergence if ported to different languages, or if the library breaks backwards compatibility for in future version, both would result in non-deterministic serialization. I do not want to tie Sidetree implementation to a particular version of a library indefinitely.

All is to say that I would rather not impose serialization rules that are unnecessary and keep things simple if possible. Since it is a non-issue for Sidetree in the foreseeable future, I will won't-fix this for now until it becomes an issue.

Will wait for your comments before closing.

OR13 commented 5 years ago

sorry for my delay. I think its totally fair to mark as won't fix. It may in practice never actually be an issue if users can not reconstructing objects locally, and are always working backwards from the contentID.