adobe / xdm

Experience Data Model
Creative Commons Attribution 4.0 International
244 stars 314 forks source link

Recommendation on supporting int64/long values #427

Open hiteshs opened 6 years ago

hiteshs commented 6 years ago

The following discussions cover the issue of why a json-schema integer cannot be used to represent int64 or long numbers:

Is there a recommendation on how to define a schema that needs to support integers that support the full spectrum of int64/uint64?

The approach mentioned in the json-schema repo discussions has been to use strings and enhance the format support for such integer/decimal types. A reference approach taken by Google (https://developers.google.com/discovery/v1/type-format) follows a similar line however the format types used are not yet part of the standard.

Until a new draft of the spec is published to introduce these new formats, it would be good to have a recommendation in place.

fmeschbe commented 6 years ago

Well, JSON Schema is not the real problem, actually. JSON Schema has a type integer which says the Number does not have a fractional part. JSON itself from a syntactic perspective is not a problem, either: The Number production just defines how a number must be formatted.

The real problem comes from interoperability and this mostly from the way JavaScript is defined, see for example the June 2018 revision specification of the JavaScript Language Number Type, which is based off of double-precision 64-bit format IEEE 754-2008. So factually only 53 Bit integer Numbers can be precisely represented in any compliant JavaScript implementation.

IMHO the discussion in JSON Schema issue 361 provides the best approach to this problem: Using a string type with a format specification.

For example extending the JSON Schema Validation specification with new sections as follows:


6.2.X precision

The value of "precision" MUST be a number, representing an inclusive upper limit on the number of places for the fraction part of a numeric value.

If the instance is a number or a string with format decimal, then this keyword validates only if the instance's numeric value is less than or exactly equal to "maximum".

7.3.X. Decimals

These attributes apply to string instances.

decimal : A string instance is valid against this attribute if it is a valid JSON string representation of a JSON Number according to the RFC 8259, section 6 number production without the exp part.

integer : A string instance is valid against this attribute if it is a valid JSON string representation of a JSON Number according to the RFC 8259, section 6 number production without the fracand exp part.

The intent of the decimal and integer formats is the ability to represent exact decimal (and integer) values exceeding the limitations imposed by double precision IEEE754 numbers underlying numberand integer typed values.

The validation keywords of Schema Validation section 6.2 apply to the numeric value of the string.

Note: Technically an integer is a decimal with a zero precision. But in the interest of readability and ease of use, the integer format is also defined.


Likewise the JSON Schema meta schema would need to be extended:

"nonNegativeInteger": {
    "type": [ "integer", "string" ],
    "format": "integer",
    "minimum": 0
},
...
"multipleOf": {
    "type": [ "number", "string" ],
    "format": "decimal",
    "exclusiveMinimum": 0
},
"maximum": {
    "type": [ "number", "string" ],
    "format": "decimal",
},
"exclusiveMaximum": {
    "type": [ "number", "string" ],
    "format": "decimal",
},
"minimum": {
    "type": [ "number", "string" ],
    "format": "decimal",
},
"exclusiveMinimum": {
    "type": [ "number", "string" ],
    "format": "decimal",
},
"precision": { "$ref": "#/definitions/nonNegativeInteger" },

We could add this to our meta schemas already today as an extension to the standard JSON meta schema.

cmathis commented 6 years ago

@fmeschbe - What about just copying the approach of Google and instead define int64 values (or long) as type:string and we then introduce a new format value of "int64". We would update our exiting data validators to know how to interpret this new format value.

Whatever we decide, we need to update the table here ASAP so users can define the fields correctly.

fmeschbe commented 6 years ago

Technically speaking the value range of int64 compared int54 is only 1000 times larger (9E18 as compared to 9E15). I don't think we buy us much with a format=int64.

Compared to that with format=decimal we get arbitrary but accurate precision decimals (BigDecimalin Java) and with format=integer we get arbitrary sized accurate integers (BigIntegerin Java).

So it would be easy to just define a subschema for int64 with appropriate minimum and maximum values.

All in all, I am not sure the story will end with int64 but with the format=(decimal|integer) proposal we are open to all needs.

lrosenthol commented 6 years ago

I'm a bit confused here...

The JSON Schema spec, as referenced, is very clear that NUMBER/Integer is arbitrary precision - which means that there is no need to specify if it is a "short" or a "long" integer. Clients will handle it as they are able to (though we should, of course, be sure to update our docs to make that clear to clients).

Using strings for numbers is just bad/wrong (esp. for security concerns) and we shouldn't be supporting that, except as a last resort. I am going through this same issue with the AEM Forms team and their desire to add "big decimal" support to PDF forms.

kstreeter commented 6 years ago

Hi @hiteshs do we have a specific case where a 64-bit integer is required, and a 53-bit integer is not sufficient?

As you and others have noted, there simply isn't an interoperable way to represent the full range of a 64-bit integer in Javascript/JSON. Anything we define is going to be limited to processors that understand our proprietary extensions, and have the capability to operate on the extended range. So we need to consider the implications.

A proprietary 64-bit integer format is only going to be useful in cases where integer values are used, 53-bits is too narrow, but the values never exceed 64-bits. (This is also assuming a signed value...is the need for an unsigned value?) I'm not sure what cases this applies to.

We have discussed support for BigDecimal numbers (which @lrosenthol mentions). Do we need both a 64-bit integer and a BigDecimal? Or if we supported BigDecimal would that cover the cases that we think require a 64-bit integer?

A definitely think we should outline a concrete use case before introducing a new, proprietary data type. This would help answer all of these questions.

hiteshs commented 6 years ago

I don't have a specific use-case on where the 53-bit space is insufficient. My main concern is that existing users already use long/bigint/int64 in various systems (Hadoop, relational DBs).

Using a decimal or big decimal also has potential performance overheads (both compute and memory).

Do we plan to make the transition for existing users of long/int64 easier? An explicit int64 type also makes life simpler for application developers. Having only decimal adds a high amount of burden on applications to become smarter about what primitive types to leverage for the necessary perf optimizations?

kstreeter commented 6 years ago

@hiteshs, I believe that we already have an approach: in XDM the equivalent of a long/int64 in other systems is an "integer" type. The limitation is that this type is only 53-bits wide. Users of int64 in other systems should use "integer" in XDM, being aware of the narrower range.

The only reason this approach would not work is if there are cases where values exceed the 53-bit space. But we haven't identified any such cases.

As has been described in this thread, JSON simply doesn't support a full 64-bit integer. Which means the only way we can define one is to create an encoding into a type that JSON does support. This will necessarily put burden on applications and tools to handle this proprietary extension. So we shouldn't do it without some careful consideration. it is difficult to do that without a use case.

lrosenthol commented 6 years ago

@kstreeter The limitation of 53bits is only in specific implementations (eg. JavaScript), not systemic to XDM or JSON. JSON itself is format & implementation agnostic (as noted in its spec).

fmeschbe commented 6 years ago

Technically @lrosenthol is right, that ideally there is no limit on the size and precision of number fields. In all practically, this does not hold true, though. Particularly RFC 7493 suggests in Section 2.2, Numbers to assume numbers are IEEE 754 binary64. This in practice limits the precise representation of integer values to a range of roughly 2**53.

I agree with you @lrosenthol that using string is suboptimal, but this is just biting the bullet of reality.

So we are left, with three options for this issue:

  1. do nothing, and assume type=integer provides enough of a hint, that we are dealing with an integer number of potentially arbitrary size.
  2. do simple and use type=string with format=int64 which solves a short-term issue but requires different approaches as we want to indicate arbitrary precision
  3. go all-in and go with a new format=decimal which allows to describe desired precision. This is more involved and provides most flexibility, event the ability to define in the schema, that numbers have limits which might be defined such that implementations may optimize to using int64.

I have a preference for the flexible approach which, of course, gives power and thus responsibility.

lrosenthol commented 6 years ago

@fmeschbe you are reading the wrong spec. The core JSON spec is RFC 7159 and the relevant section is 6. There is very clearly says:

This specification allows implementations to set limits on the range and precision of numbers accepted.

Of course, it also takes the same point you and others have taken - that 2^53 is a best practice.

Since software that implements IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide

All that said, regardless of anything else, we need to do #1 to avoid any confusion in the future. As far as what else to do, I would be willing to consider #3 if we picked a very specific standard (eg. BigDecimal from Java) that it meant. Leaving it vague is no better than the problem that got us here in the first place.

fmeschbe commented 6 years ago

I> As far as what else to do, I would be willing to consider json-schema-org/json-schema-spec#3 if we picked a very specific standard (eg. BigDecimal from Java) that it meant. Leaving it vague is no better than the problem that got us here in the first place.

I think my proposal for #3 is pretty well defined without going into implementation detail for a single implementation:

The intent of the decimal and integer formats is the ability to represent exact decimal (and integer) values exceeding the limitations imposed by double precision IEEE754 numbers underlying number and integer typed values.

We could certainly go for better wording.

But if we'd go for #3 I would be inclined to actually push this into the JSON Schema Validation spec.