cognitect / transit-format

A data interchange format.
1.88k stars 36 forks source link

Please clarify integer encoding rules #15

Closed jlouis closed 10 years ago

jlouis commented 10 years ago

There are three ways to represent an integer in Transit:

There are certain limitations which governs when an integer can be used, so in string-context, I have to use the ~i string-rep variant for instance. And for a large number, only the ~n variant suffices as per the rules. These situations are easy, because there is only one possible representation.

But say I want to represent the above number, 1234. There are now three possible ways to represent it, namely the above three. Which one is canonical? Should a parser be ready to accept any representation?

The reason I rise this question is due to a parsing problem in ints_interesting.json. The really interesting values are:

[...,
 "~i4611686018427387906", // A
 "~n9223372036854775806", // B
 "~n9223372036854775807", // C
 "~n9223372036854775808", // D
 "~n9223372036854775809", // E
 ...]

Do note the value 2^63 - 1 = 9223372036854775807 which is the largest integer representable in a 64 bit value (one bit goes to the sign). In the above, this means that the values D and E don't fit in a 64 bit value, whereas the values A, B, and C do.

So, why does the current example file represent them as ~n values when ~i would do? Should parsers be ready to accept this or is there a canonical least representation which they should pick?

timewald commented 10 years ago

Transit has two integer types - integer and big integer.

Integer is a signed 64-bit value. In msgpack it is represented as the smallest integer type that can store the value (possibly just a byte). In json, it is represented as a number unless it requires > 53 bits, in which case it is encoded as a string tagged with i.

Big integer is an arbitrary precision value and must be used for anything that won't fit in integer. It is always tagged with n.

Given the number 1234, it can be represented in any of these forms. In Java and Clojure, the choice is based on the type of the value passed in - if it is a long (or short, int, or byte) it maps to a Transit integer. If it is a java.lang.BigInteger or clojure.lang.BigInt, it maps to a Transit big integer. It is legal in these languages to represent 1234 as a big integer, and Transit will marshal it marked with n.

In languages that use dynamically-sized integers, like Ruby and Python, where the application programmer does not control the type, the mapping is based on the value of the integer. In this case, 1234 would always go as an integer and, because it is smaller than 53 bits, it would never be encoded with i.

Readers should support any integer passed unencoded, or encoded with i or n. That is, "~i8" should decode correctly - even though there is never a case (today) where a writer will send that. There is no reason not to support it, and indeed it complicates things to do otherwise.

The file you mention contains different encodings of numbers at the edge of what can be represented with a given Transit type. It is not intended to demonstrate the best way or required way to send something, but only to test that readers handle these different possibilities.

If you are implementing a kit, pick a mapping to integer and big integer that makes sense for you language. As noted, the Java and Clojure Transit kits work fundamentally differently from Ruby and Python Transit kits in this regard.