[Question] Reproducibility and ordering of fields

amazon-ion / ion-go

A Go implementation of Amazon Ion.

https://amazon-ion.github.io/ion-docs/

Apache License 2.0

175 stars 31 forks source link

[Question] Reproducibility and ordering of fields #200

Closed ivanjaros closed 2 months ago

ivanjaros commented 2 months ago

JSON gives not guarantees for ordering of maps and objects, only arrays. This can cause non-reproducible outputs for storage and hashing purposes.

Since Ion is used in Amazon Quantum Ledger Database, I would expect ordering to be handled in a way that results in reproducible results.

Is that the case?

popematt commented 2 months ago

Ion defines equivalence in an ordering independent way, and the Ion Hash algorithm is order agnostic when it comes to struct fields.

When comparing results using the Ion data model, the results are always reproducible. The result of Ion Hash is always reproducible. So, QLDB guarantees reproducible results in that the results are equivalent (not identical).

The distinction between "identical" and "equivalent" is important here. Two serialized values may be equivalent but not identical because there are a variety of valid encodings for many values (e.g. 1.0 is equivalent to 10d-1). In memory (i.e. Ion data model objects) are agnostic of the serialized form, and direct comparison between them is effectively the same as checking for equivalence.

ivanjaros commented 2 months ago

Hm, that is what I thought(since it states to be a subset of json). This is problematic for using ion, or json, as storage format because you cannot guarantee the hash of the content will match, even if no data has changed, only the language/interpreter that produces it.

Thanks for the confirmation.

popematt commented 2 months ago

This is problematic for using ion, or json, as storage format because you cannot guarantee the hash of the content will match, even if no data has changed, only the language/interpreter that produces it.

This is why we developed the Ion Hash algorithm that I mentioned. Using the Ion CLI, we can see, for example, how Ion Hash can produce a hash that is independent of the ordering of struct fields, whitespace, and the specific encoding of e.g. a decimal value.

$> echo "{foo:1.0,bar:2}" | ion -X hash sha-256
3272f0d8ca9c0252d0f218d516da65cff2129fff413d2c2a8ca8e7d9f13080c3

$> echo "{ bar: 2, foo: 10d-1 }" | ion -X hash sha-256
3272f0d8ca9c0252d0f218d516da65cff2129fff413d2c2a8ca8e7d9f13080c3

If you use the tooling developed for Ion, you can guarantee that the hash of the content will match.

ivanjaros commented 2 months ago

That is very good, but that also means it has to parse that data before hashing it which imposes significant performance penalty than merely hashing raw bytes.