Add Constraints for Smaller IEEE-754 Binary Types

amazon-ion / ion-schema

The Ion Schema Specification. This specification is licensed under the Apache 2.0 License.

https://amazon-ion.github.io/ion-schema/

Apache License 2.0

13 stars 10 forks source link

Add Constraints for Smaller IEEE-754 Binary Types #18

Closed almann closed 1 year ago

almann commented 4 years ago

If a user wants to model binary32 or binary16 ranges, it cannot be done due to precision/scale aspects of binary floating-point.

My straw man suggestion here is to have a constraint around float_type that can be one of binary16 or binary32.

popematt commented 3 years ago

Fixed by https://github.com/amzn/ion-schema-kotlin/pull/164

popematt commented 2 years ago

For the 1.2.0 release (https://github.com/amzn/ion-schema-kotlin/pull/182), it was discovered that no change was made to the spec for this (and that the implementation was incomplete), so https://github.com/amzn/ion-schema-kotlin/pull/164 was reverted.

We need to update the spec before re-introducing this feature.

We should consider allowing this constraint to be applied to Decimals and Ints as well as Floats, since there may be use cases where the schema author would allow any Ion number type in an input (that is validated by a schema), but in some other part of the application they want to represent the value as a float.

As an implementation consideration, we may want to see if we can come up with an algebraic solution so that we do not have to rely on converting to a specific format of floating point number. An algebraic solution (rather than actually converting) would make it possible to implement this constraint so that it supports all IEEE floating point interchange formats (binary16, binary32, binary64, binary128, binary256, decimal32, decimal64, and decimal128) regardless of whether the underlying platform supports floating point numbers in that specific format.

The formula in this CS StackExchange answer might be a good starting point.

Another useful resource might be this ieee754lib that supports all of the binaryN formats.

popematt commented 2 years ago

What if we took this concept in a completely different direction and added a new type of argument for valid_values that represents pre-defined sets of values? (Something vaguely similar to character classes/sets in Regex.)

This approach is nice because it can be applied to any number type (not just floats). It can also be extended to support future use cases (see some hypothetical examples below). This approach won't conflict with any user-defined content, and is backwards compatible with Ion Schema 1.0. The only disadvantage I see right now is that it adds more syntax to the valid_values constraint—possibly making it too complicated.

Here are some examples using a straw man syntax. (I don't really like the use of the word "class" because it's so overloaded, but I can't think of a better term for it right now.)

// TypeA can be any number (int, float, or decimal) that can be represented 
// exactly (ie. without rounding) by the IEEE-754 binary32 interchange format.
type::{
  name: TypeA,
  valid_values: class::ieee754_binary32,
}

// TypeB can be any number (int, float, or decimal) that can be represented 
// exactly (ie. without rounding) by the IEEE-754 decimal64 interchange format.
type::{
  name: TypeB,
  valid_values: class::ieee754_decimal64,
}

// TypeC can be any number (int, float, or decimal) with no fractional part
type::{
  name: TypeC,
  valid_values: class::integral_numbers
}

// TypeD can be any timestamp that can be represented as 
// a 32 bit integer unix time (regardless of the offset)
type::{
  name: TypeD,
  valid_values: class::int32_unix_time
}

@almann and @alancai98, what do you think of this?

almann commented 2 years ago

I am not opposed to this, but why overload valid_values and not create a different constraint all together called value_class or sormething?

I also agree with the thinking that this is over complicating what should be a relatively simple constraint. What is the generalization really buying us here--we can make any constraint applicable to any type we want in future versions, so really what this is buying us is some level of orthogonality for the concept of value class, what does this orthogonality buy us?

In other words, what are the advantages to having integral: true vs valid_values: class::integral?

alancai98 commented 2 years ago

Also not opposed to the idea. I think your last example TypeD is a pretty interesting use case.

You mentioned that the approach "can be applied to any number type", but what if an ISL user wanted to restrict the type to just one type (e.g. float)? For example, as float_size was previously implemented, it restricted the valid values to just floats, so the integer value 1 would not be valid.

Regarding modeling this approach through another valid_values, I'm also not sure of the advantages of using valid_values over another constraint as Almann had mentioned.

Out of curiosity, would these "value classes" live as part of ISL's default implementation? Could they be imported and/or extended?

popematt commented 2 years ago

@almann

why overload valid_values and not create a different constraint all together called value_class or something?

Originally, I was seeing some tension between "minimal set", "simple", and "orthogonal" here. Value classes seemed like they have some overlap with valid_values. However, since we both think that this overcomplicates valid_values, then we should drop the idea of extending valid_values.

we can make any constraint applicable to any type we want in future versions

In theory, yes we can. However, if we have a float_type constraint that only applies to float, but then we make it applicable to decimal in future, that would be a breaking change. It would be nice to figure out something that seems a little more future proof.

what this is buying us is some level of orthogonality for the concept of value class, what does this orthogonality buy us?

"Orthogonal", "Simple", and "Minimal Set" are stated as the guiding principles for the constraints (in the Rationale section of the spec), so orthogonality buys us alignment with the design principles of Ion Schema. Orthogonality of constraints is/was important enough that it was one of the reasons that Ion Schema is not an extension of Json Schema.

I think there's potential to have too many constraints that answer the question "can this value be represented in a particular encoding"? That's what the float_type constraint is trying to test, if I'm understanding correctly. Testing whether something could be converted to another representation seems like a common use case. If we end up with many constraints like float_type, is_integral, fits_in_32_bit_unix_timestamp, then we might be doing poorly on the "minimal set" and the "orthogonal" goals.

What do you think? Is it worth trying to generalize the "can this value be represented in some other encoding" concept?

@alancai98

You mentioned that the approach "can be applied to any number type", but what if an ISL user wanted to restrict the type to just one type (e.g. float)?

You can do that with the type constraint:

type::{
  name: foo,
  type: float,           //  <-- this will restrict the value to be an Ion float
  float_type: binary32,  //  <-- this would restrict the numeric value to be something 
                         //      that can be represented by ieee754 binary32
}

Since we have a type constraint, I think that we should make the other constraints broadly applicable when possible. In this particular case, we're not testing if a value "is" a certain type of float, but rather if it "could be represented as" a certain type of float, so I don't think there should be a requirement that the value is already a float. However, I realize that the name float_type implies that the value is a float, so I think there's still some work to be done in order to make sure that the constraint is not confusing to users.

Out of curiosity, would these "value classes" live as part of ISL's default implementation? Could they be imported and/or extended?

This is a good question. I'm sure it's possible that they could be created in an extensible way somehow. I didn't look into that at all because I was just looking for some early feedback on the idea.

pbcornell commented 2 years ago

I'm probably missing context, but chiming in in case the following is of help:

it's tempting to reuse the precision constraint for this, but binary32 is a far more encoding-oriented term than precision: 5
have you considered modeling as type: binary32::float? perhaps this reads more like binary32 is an encoding detail of the type

In the TypeC example, encoding a float with no fractional part presents some interesting challenges. And a decimal with no fractional part sounds like scale: 0.

Also, perhaps the TypeD example is already covered by timestamp_precision: second?

popematt commented 2 years ago

Hey, thanks for the input @pbcornell! Even if you are missing some context, I think I agree with everything you've brought up.

Based on everyone's feedback, I have come to see that my suggested generalization was an interesting idea, but not a good fit for Ion Schema.

Here's where I think we are at right now with the float_type constraint. We can all agree that it is useful and should be introduced to Ion Schema. We agree that it should be able to detect not just when a number would overflow or underflow a particular float interchange format (as in the original implementation), but also when a that float interchange format does not have the necessary precision to represent the number (ie. a binary64 float that must be rounded if stored as binary32). I think we all agree that the constraint should not be checking the actual text/binary encoding or DOM representation of the value.

Here are the things I think we still need to address.

Naming

It would be nice to have some name other than float_type, since type is such an overloaded term. Other possibilities might be float_format or float_size, but these names seem to imply something about the backing representation of the number. Ideally, we would have a name that does not imply something about the backing representation of the number. However, if we can't think of anything better than float_type, I don't think this should be a blocker.

Applicability

I think there are two use cases for making this constraint apply to all number types—not just floats.

Ion can interpret equivalent Json numbers as ints, floats, or decimals (eg. 1, 1e0, 1.0 are all the same Json number, but are Ion values with different types). I believe that the most customer-friendly experience for developers who must deal with Json/Ion inter-op is to make the constraint applicable to all number types.

Making this constraint applicable to all number types will also set us up to enable use cases for IEEE-754 interchange formats that are not part of the Ion specification. For example, MongoDB and Amazon DocumentDB use decimal128 (by way of BSON), and if one were to create an application that uses Ion as the wire format and DocumentDB for the storage, one may wish to create a schema that defines a subset of Ion that is compatible with BSON. Another use case would be modeling floating point numbers in Ion that are larger than 64 bits. They cannot be represented as an Ion float (since Ion limits floats to 64 bits), so a natural choice would be to store them as decimals, and so users may wish to to use this constraint for testing decimals against binary128 or binary256.

I believe the amount of extra work to enable the constraint for all number types should be trivial as most (if not all) languages with Ion libraries have some way to convert between integers and floating point numbers and between arbitrary precision numbers and floating point numbers.

Is anyone opposed to making this constraint applicable to all number types?

almann commented 2 years ago

Really nice to hear from you @pbcornell, thanks for the input.

I am +0 on making this applicable to to other number types, I don't explicitly oppose it, but I think it is just more complexity and I can't think of really compelling use cases to accept some other types that could fit in a different types encoding subset.

I do like the idea of supporting other IEEE-754 constraints in the futures, but I wonder if those should just be flat out different constraints (some spitball spellings):

ieee754_float: binary32
ieee754_decimal: decimal128

Whether or not the constraint is applicable across the type boundary, is orthogonal. If we allow it this means you could have both constraints above :woman_shrugging: without having to make a more general constraint take a list.

pbcornell commented 2 years ago

I keep wanting binary32 to be some kind of alias for precision: x (bits), scale: y (bits), but without digging deeper I suspect this idea presents some different challenges. ;)

I think we all agree that the constraint should not be checking the actual text/binary encoding or DOM representation of the value.

If I'm following, Ion Schema would perform a best guess as to whether a value will by lossy or hit a boundary, but it would not catch all issues.

Ideally, we would have a name that does not imply something about the backing representation of the number.

As soon as I see binary32 (and/or ieee*), it's difficult not to read that as encoding. The idea seems to be to leverage some aspects of the sizing associated with an interchange format (binary32) without strictly guaranteeing a particular value will fit. Use of such terms seems ok (for lack of less encoding-oriented terms), but of course the spec/docs would need to be clear about the checks the constraint actually performs, and what it won't catch.

And for me, @almann's ieee754 prefix provides helpful context for binary32 (i.e., it's not Ion binary/decimal, it's an IEEE reference).

Two cents. :)

popematt commented 2 years ago

I do like the idea of supporting other IEEE-754 constraints in the future

Sure, and I agree that we can punt that to the future.

I can't think of really compelling use cases to accept some other types that could fit in a different types encoding subset

I respectfully disagree, but I've realized that this particular concern can be decoupled from the constraint itself, and we can deal with it later in a backwards compatible way if/when the need arises.

And for me, @ almann's ieee754 prefix provides helpful context for binary32 (i.e., it's not Ion binary/decimal, it's an IEEE reference).

I agree, and I don't think a better name exists.

Thanks for all of the feedback you have provided @pbcornell and @almann. I'm going to draft a proposal for the next version of the Ion Schema spec and share it here.

popematt commented 2 years ago

Syntax

<IEEE754_FLOAT> ::= ieee754_float: binary16
                  | ieee754_float: binary32
                  | ieee754_float: binary64

The binary64 value is not strictly necessary, but...

it feels more complete to include it.
it does slightly simplify use cases for programmatic construction of types
it could have some value as documentation

It could go either way, though.

Evaluation

any value that is not an Ion float is invalid
null.float is invalid
nan, +inf, and -inf are always valid
for numeric values, a value is valid if and only if it can be losslessly converted from the in-memory representation to the specified float representation, and back to the original type of in-memory representation.

Sample Implementation

This example assumes we have a Half data type that is an implementation of a 16 bit floating point number (with similar functions as Float and Double).

enum class Ieee754InterchangeFormat {
    Binary16,
    Binary32,
    Binary64
}

private fun test(value: IonValue, interchangeFormat: Ieee754InterchangeFormat): Boolean {
    if (value !is IonFloat || value.isNullValue) return false

    // If +inf, -inf, or nan...
    if (!value.doubleValue().isFinite()) return true

    return when (interchangeFormat) {
        Binary16 -> value.doubleValue().toHalf().toDouble() == value.doubleValue()
        Binary32 -> value.doubleValue().toFloat().toDouble() == value.doubleValue()
        Binary64 -> true // All Ion Floats are 64 bits or smaller.
    }
}

popematt commented 1 year ago

Resolved by #76