amazon-ion / ion-docs

Source for the GitHub Pages for Ion.
https://amazon-ion.github.io/ion-docs/
Apache License 2.0
23 stars 23 forks source link

Require consistent negative zero encoding for ion-hash's sake #133

Open zslayton opened 4 years ago

zslayton commented 4 years ago

Originally from: https://github.com/amzn/ion-js/issues/474

Surfacing this here since we haven't discussed it recently and are adding more implementations of ion-hash.

While ion-js's output does differ from ion-java's, both libraries are writing valid representations of equivalent Ion values. See this example from the binary format section of the Decimals spec:

The following binary encodings of decimal values are equivalent to -0d0 (but not to 0d0).

+-----------------+------------+-------------+
| type descriptor |  exponent  | coefficient |
|                 |  (VarInt)  |    (Int)    |
+-----------------+------------+-------------+

Explicit encoding of (negative)0d0
+-----------------+------------+-------------+
:      0x52       :    0x80    :    0x80     |
+-----------------+------------+-------------+

Explicit encoding of (negative)0d(negative)0
+-----------------+------------+-------------+
:      0x52       :    0xC0    :    0x80     |
+-----------------+------------+-------------+

While we can/should decide on a canonical format for our own libraries to emit, we should also clarify in the spec that writers MUST emit that same format if that's required to enable ion-hash.

In the meantime, I'll look into adding special cases for ion-js.

Originally posted by @zslayton in https://github.com/amzn/ion-js/issues/474#issuecomment-546487916

PeytonT commented 3 years ago

I get the impression that the issue of inconsistent valid encodings also exists for positive zero.

The spec for binary decimals gives

The subfield [coefficient] should not be present (that is, it has zero length) when the coefficient’s value is (positive) zero.

If the value is 0. (aka 0d0) then L is zero, there are no length or representation fields, and the entire value is encoded as the single byte 0x50.

Since the first is a should rather than a must guidance, writers are permitted to write positive zero with or without a coefficient subfield.

The binary documentation makes this explicit later on in the Timestamps section, and even uses 0d0 as the example of a decimal value being valid with implicit or explicit zero coefficient.

The fraction_exponent and fraction_coefficient denote the fractional seconds of the timestamp as a decimal value... A missing coefficient defaults to zero... The following hex encoded timestamps are equivalent: ...

69 80 0F D0 81 81 80 80 80 80    // The same instant with 0d0 fractional seconds and implicit zero coefficient
6A 80 0F D0 81 81 80 80 80 80 00 // The same instant with 0d0 fractional seconds and explicit zero coefficient

The Ion-Hash decimal spec indicates that the coefficient subfield is required to not be present for positive zero values.

The coefficient subfield shall not be present (that is, it has zero length) when the coefficient’s value is (positive) zero.

And requires the special compact encoding of 0d0.

If the value is 0. (aka 0d0) there is no representation subfield.


While we can/should decide on a canonical format for our own libraries to emit, we should also clarify in the spec that writers MUST emit that same format if that's required to enable ion-hash.

Notably, changing to MUST guidance was not the approach taken for the substantially similar issue https://github.com/amzn/ion-docs/issues/93.

It does not seem here that changes to the Ion specification are necessary to enable ion-hash. For both positive and negative zero, the issue of inconsistent valid representations could be resolved by requiring conversion to a particular canonical form before hashing, in the same way that the ion-hash spec resolves the issue of padded UInt/Int/VarUInt/VarInt fields.

Any UInt, Int, VarUInt, or VarInt fields are encoded in the manner defined in the Amazon Ion Binary Encoding specification with one difference: such fields shall encoded in minimal fashion – no padding is allowed.

tgregg commented 3 years ago

the issue of inconsistent valid representations could be resolved by requiring conversion to a particular canonical form before hashing, in the same way that the ion-hash spec resolves the issue of padded UInt/Int/VarUInt/VarInt fields.

This sounds right to me.