eclipse-esmf / esmf-semantic-aspect-meta-model

Formal and textual specification of the Semantic Aspect Meta Model (SAMM)
https://eclipse-esmf.github.io/samm-specification/snapshot/index.html
Mozilla Public License 2.0
46 stars 9 forks source link

[Task] Improve datatype mapping from RDF/XSD to JSON #175

Open atextor opened 2 years ago

atextor commented 2 years ago

Is your task related to a problem? Please describe. The section Payloads of the specification texts describes how values with the datatypes defined in an Aspect Model are to be serialized in JSON in an Aspect's payload. The mapping generally works as follows: xsd:boolean is turned to JSON boolean, numeric types are turned to JSON numbers and everything else is turned to JSON string. This mapping however can be problematic due to a loss of information: In particular, large values for unbounded numeric types such as xsd:decimal, xsd:integer and xsd:positiveInteger can be problematic. Although JSON by definition does not limit numbers' length (see the "number" production rule in the JSON's grammar in the spec), effective use is limited by the limits imposed by ECMA script. This is also referred to in the Data type mappings subsection of SAMM's Payload section.

Furthermore, the Payload mapping must specify how "special" numeric values Inf, -Inf and NaN are handled, that are valid for xsd:float and xsd:double, but can not be represented in JSON.

Describe the solution you'd like The Payload mapping should be changed so that the JSON data type that is used to represent values for the unbounded XSD types is string instead of number. This way any large number can be represented without loss of information, while keeping the "principle of least suprise" for users by sticking to native number types for xsd:int, xsd:long, xsd:short as well as xsd:boolean.

Regarding "special" numeric values, the JSON number type should be kept for xsd:float and xsd:double values to not add major inconveniences (and inconsistency with existing REST APIs) to handle these seldomly occuring corner cases. I would propose to define null to stand for NaN (as is customary in purely JSON-based applications) and to no try to fix JSON's shortcomings regarding infinity values (i.e. define that those values can not be represented in an Aspect payload).

atextor commented 2 years ago

WG discussion 2022-09-08:

Topic large values for unbounded numeric values:

Topic INF/-INF:

Resolution: Postpone decision for now, gather more input

BirgitBoss commented 2 years ago

In https://github.com/admin-shell-io/aas-specs/pull/236/#discussion_r966685795 the statement is that also for xs:long Number is not a valid mapping but should be mapped to string as well.

BirgitBoss commented 2 years ago

See here for arguments from https://github.com/mristin why to use ONLY string in JSON serializations: https://github.com/admin-shell-io/aas-specs/blob/9c342305f04ecb35baa050a17cad6928ba0ba519/schemas/json/README.md

BirgitBoss commented 2 years ago

For xsd:base64Binary please add that it is not just mapped to string but additionally encoded with base64. Thanks, @mristin for the hint.

mristin commented 2 years ago

@atextor (I'm the author of the comments on https://github.com/admin-shell-io/aas-specs/pull/236.)

I'm sharing here a couple of more edge cases that you might want to consider. There must be more, but those were apparent when I looked into the XSD specification.

A couple more remarks:

A note about JSON with regard to ECMA: I'd recommend you to ignore JavaScript / ECMA and consider only RFC 8259, unless you want to tie your implementation to JavaScript. Other languages have different conventions. For example, Python parses integer numbers with arbitrary size (2^1234 can be parsed and serialized in Python without problem, see below).

In Section 6 "Numbers" of RFC 8259, they are not specific about the number limits:

This specification allows implementations to set limits on the range and precision of numbers accepted.

They do note, however, that IEEE 745 is widespread and should be considered:

Since software that implements IEEE 754 binary64 (double precision) numbers IEEE 754 is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision.

If you read from JSON, than you have to be careful! From Section 6 "Numbers" of RFC 8259:

A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available.

As far as I know, you can not easily check if you loose precision when you parse a number from a string. For example, C# and Python give you infinity if the number is too large for IEEE 754 double, but you obtain only a rounded number if it is too precise.

Example in Python:

>>> import json

>>> json.loads(
...     '1e123456789123456789123456789123456789123456789'
... )
inf

Note: e indicates a floating point number! When the number is an integer (without e and .), it works differently:

>>> json.loads(
...     '123456789123456789123456789123456789123456789'
... )
123456789123456789123456789123456789123456789

The precision is silently lost in Python (notice the loss of precision after roughly 17 decimal points):

import json

>>> "{:.50f}".format(
...    json.loads("123456789123456789123456789123456789123456789"))
'0.12345678912345678379658409085095627233386039733887'
# No exception

This is expected as IEEE 754 can only represent up to 17 decimal points exactly. However, you are not notified by the library that your JSON had a higher precision.

Here is a similar example with large numbers (notice again the loss of precision roughly after 17th digit):

>>> "{:.50f}".format(
...    json.loads(
...        "123456789123456789123456789123456789123456789"
...    )
... )
'123456789123456789439311560846449175093575680.00000000000000000000000000000000000000000000000000'

You have to check the behavior language by language and library by library.

mristin commented 2 years ago

@atextor I just remembered one more point against null as a symbol for NaN.

Namely, when you use reflection-based JSON libraries to parse the data, you can not distinguish between properties whose values were set to null and properties which were not defined at all. Notably, C#, Java and Golang libraries come to mind.

While in many cases this does not matter, there are scenarios where you want to distinguish the two.

Here's an admittedly constructed example so you get an idea. In some situations the measurement was not performed at all (e.g., the sensor was turned off -> the property not defined). In other situations the sensor was turned on, but could not precisely measure (the property set to NaN). Finally, you'd like to see the stats over the two (# of situations when sensor turned off, # of situations when sensor imprecise). If you used null, both situations would be represented the same.

I don't know if this projects well to your use case, but it is definitely a good test to check if there are no such distinctions in the semantics in your model. Of course, you can always use additional properties, but having a NaN + undefined at disposal is sometimes a nice tool to encode values in a single property succinctly.

atextor commented 1 year ago

Hi @mristin, thank you for your valuable input!

I think the most important question needs to be clarified first: Is there there an explicit requirement to be able to represent every valid XSD value in the JSON serialization without information loss? At least in the context for BAMM, such a requirement does not exist as of now. Although it might seem counterintuitive to even question this, there are other requirements that actually contradict a purely technical solution that would fulfil it (e.g., just putting everything in strings). Having BAMM and its tooling be a helpful tool for "regular" developers (i.e, developers without a background knowledge in semantic systems) is also a requirement, and this includes the principle of least surprise (e.g, "Why would you put my int values into quotes? This does not make any sense. Now I have to work around my framework's automatic type deserialization?").

In short, the payloads should consider the following:

That being said, we should of course try to cover as much ground as possible. In particular, we must make sure to not have undefined behaviour (i.e., if we do end up with different value ranges compared to XSD, they must be properly documented), but IMO it's not our duty to solve JSON's shortcomings at all costs. This might warrant some more discussion.

* `+0` and `-0` are often disregarded in JSON representation (but are valid in XSD). It makes a world of difference for arct-tan and divmod functions. See this [StackOverflow answer](https://stackoverflow.com/a/4083431/1600678).

Is there an agreed-upon solution for this other than "always send this particular numeric property as a string"?

* XSD allows zeros in prefix, JSON not (`0001` is a valid `xs:int`).

* XSD allows `1` and `0` as `xs:boolean`.

I think we must not confuse a value with its lexical representation. The values of xsd:int "1" or xsd:int "01" are identical and can both be represented in JSON by 1. There is no semantics in different lexical representations of the same value. It would be wrong to handle those values any differently. The same holds for numeric 1 and 0 as boolean values: They happen to map to true and false but this must not imply any different semantics for the value, i.e. parsing the XSD value and then serializing it in JSON must lead to the same canonical representation (which in JSON are of course true and false).

* (as @BirgitBoss noted) `xs:long` is 64-bit, so over 52-bit of IEEE 754 exact representation for integers.

Birgit and you are correct and this is an imporant omission; xsd:long should be handled in the same way as xsd:integer etc.

* `xs:float` would have a different precision than `xs:double`. Using the decimal representation would lead to lossy conversions (this is probably not critical, but would still prevent a round-trip and cause equalities and tests to fail).

If the model specifies xsd:float as a type, this defines the contract: The Aspect sending data should also adhere to the contract and should not send data with an allegedly higher precision. Neither sender (Aspect) nor may (client) should assume anything else.

A couple more remarks:

* I wouldn't consider `INF` really an edge case. It shows up, say, for sensor values when the sensor overflows. It is also important to use such values in further computations. If a sensor reports a value too large, the checks should also pass if we check against a threshold within the sensor specification.
  Think of the scenario: sound an alarm if temperature over 300°C. You know that your sensor range is up to 1000°C and you observe an `INF`. The program should ring the alarm bell, not throw an exception.

This is easily solved on the model level and needs no solution on the level of data serialization. You even mentioned it yourself: If the range specified in the model is "up to 1000°C" then every value larger than that, including INF, is by definition an error. This means that the Aspect must not send such an error because it would otherwise violate the contract. If it still does, it is undefined behaviour and the client program should definitely not ring the alarm bell.

* `-0` needs to be distinguished from `+0` as XSD distinguishes them as well in `xs:nonPositiveInteger`, `xs:negativeInteger` _etc._ Please see the lexical representation for all the edge cases. This is important if you copy the value from one type to the other.

This should be adressed (see my question above) but not because of the lexical representations.

* `NaN` are not always represented as `null`. This is language-specific. For example, Python represents `NaN` as a JSON string `"NaN"`, and `+INF` as a JSON string `"Infinity"`.

* I don't know if you target machine learning with your specification. It is really wide-spread in that field to use `NaN` for missing values, and most algorithms working with missing values are implemented by expecting the missing values to be marked with `NaN`.

Yes, there is no standard serialization, but we could define it like this for Aspect payloads. This is what I meant with "cover as much ground as possible": We can easily define null to stand for NaN for properties with xsd:float/xsd:double type without violating any of the requirements; it would still be a perfectly valid and reasonable JSON payload from the point of view of non-semantic developers as long as it is documented properly. This would just require to change the payload rule that a null value for a property that is marked as optional is equivalent to the property not being present.

A note about JSON with regard to ECMA: I'd recommend you to ignore JavaScript / ECMA and consider only RFC 8259, unless you want to tie your implementation to JavaScript. Other languages have different conventions. For example, Python parses integer numbers with arbitrary size (2^1234 can be parsed and serialized in Python without problem, see below).

In Section 6 "Numbers" of RFC 8259, they are not specific about the number limits:

This specification allows implementations to set limits on the range and precision of numbers accepted.

They do note, however, that IEEE 745 is widespread and should be considered:

Since software that implements IEEE 754 binary64 (double precision) numbers IEEE 754 is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision.

Since one of the main places where Aspect data is consumed is in web apps, we can certainly not ignore JavaScript/ECMA script. Known limitations of the JSON spec itself, but also of well-known languages and frameworks that will likely be used in consuming the data (this includes Python of course), should be taken into account if it is reasonable. We will certainly not find great solutions for all details (for that JSON and its ecosystem are too broken tbh 😜), we need to find some sweet spots. So for example, following the JSON spec to the letter and allowing huge numbers that work in Python but break in JavaScript is not helpful.

If you read from JSON, than you have to be careful! From Section 6 "Numbers" of RFC 8259:

A JSON number such as 1E400 or 3.141592653589793238462643383279 may indicate potential interoperability problems, since it suggests that the software that created it expects receiving software to have greater capabilities for numeric magnitude and precision than is widely available.

As far as I know, you can not easily check if you loose precision when you parse a number from a string. For example, C# and Python give you infinity if the number is too large for IEEE 754 double, but you obtain only a rounded number if it is too precise.

Example in Python:

>>> import json

>>> json.loads(
...     '1e123456789123456789123456789123456789123456789'
... )
inf

Note: e indicates a floating point number! When the number is an integer (without e and .), it works differently:

>>> json.loads(
...     '123456789123456789123456789123456789123456789'
... )
123456789123456789123456789123456789123456789

The precision is silently lost in Python (notice the loss of precision after roughly 17 decimal points):

import json

>>> "{:.50f}".format(
...    json.loads("123456789123456789123456789123456789123456789"))
'0.12345678912345678379658409085095627233386039733887'
# No exception

This is expected as IEEE 754 can only represent up to 17 decimal points exactly. However, you are not notified by the library that your JSON had a higher precision.

Here is a similar example with large numbers (notice again the loss of precision roughly after 17th digit):

>>> "{:.50f}".format(
...    json.loads(
...          "123456789123456789123456789123456789123456789"
...    )
... )
'123456789123456789439311560846449175093575680.00000000000000000000000000000000000000000000000000'

You have to check the behavior language by language and library by library.

Yes, this differs from language to language; this is to be expected with IEEE 754. This is why you would use xsd:decimal instead of xsd:float/xsd:double if you require such precision; in the sds-sdk where Java code is generated for properties with such types, the type java.math.BigDecimal is used for this which can represent such values without losing precision as it's not based on IEEE 754. I think on the side of spec, (1) corresponding limits need to be documented and (2) in a modeling best practices document such cases could be addressed with some explanations and examples (i.e. "don't use float/double but decimal instead if..."). Then if you have a deserialization implementation (sds-sdk for Java, or the soon-to-be-released sds-sdk-py for Python), this can take care of this for you.

I just remembered one more point against null as a symbol for NaN. Namely, when you use reflection-based JSON libraries to parse the data, you can not distinguish between properties whose values were set to null and properties which were not defined at all. Notably, C#, Java and Golang libraries come to mind.

This is why the Aspect payload mapping forbids null values (except for properties marked as optional, where it is equivalent to the property not being present - as mentioned above, this might have to be changed if we want to accomodate NaN using null). So, either the property is mandatory in the model, then in the JSON data the property must be present and must be a valid value in the corresponding range (never null). Or the property is optional, then it can be missing in the payload (indicating an empty/nonexistant value) or be present in the payload with a corresponding value.

Here's an admittedly constructed example so you get an idea. In some situations the measurement was not performed at all (e.g., the sensor was turned off -> the property not defined). In other situations the sensor was turned on, but could not precisely measure (the property set to NaN). Finally, you'd like to see the stats over the two (# of situations when sensor turned off, # of situations when sensor imprecise). If you used null, both situations would be represented the same.

This could (and should!) also be cleanly modeled in the Aspect model instead of trying to cram everything into the JSON serialization, for example using an Entity in combination with a corresponding enumeration of the possible states:

:sensorValue a bamm:Property ;
  bamm:characteristic [
    a bamm:SingleEntity ;
    bamm:dataType :SensorValue ;
  ] .

:SensorValue a bamm:Entity ;
  bamm:properties ( :sensorState [ bamm:property :sensorMeasurement ; bamm:optional true ] ) .

:sensorState a bamm:Property ;
  bamm:characteristic [
    a bamm-c:Enumeration ;
    bamm:dataType :SensorState ;
    bamm-c:values ( :SensorStateOffline :SensorStateMeasurementFailure :SensorStateMeasurementSuccess ) ;
  ] .

:SensorState a bamm:Entity ;
  bamm:properties ( :stateCode [ bamm:property :stateDescription ; bamm:notInPayload true ] ) .

:stateCode a bamm:Property ;
  bamm:characteristic [
    a bamm-c:Code ;
    bamm:dataType xsd:string ;
  ] .

:stateDescription a bamm:Property ;
  bamm:characteristic bamm-c:Text .

:SensorStateOffline a :SensorState ;
  :stateCode "OFFLINE" ;
  :stateDescription "The sensor is offline" .

:SensorStateMeasurementFailure a :SensorState ;
  :stateCode "FAILURE" ;
  :stateDescription "The sensor is online, but reading a measurement failed" .

:SensorStateMeasurementSuccess a :SensorState ;
  :stateCode "SUCCESS" ;
  :stateDescription "The sensor is online and reading a measurement succeeded" .

:sensorMeasurement a bamm:Property ;
  bamm:dataType xsd:float ;
  bamm:characteristic [
    a bamm-c:Measurement ;
    bamm-c:unit unit:degreeCelsius ;
  ] .

And then the valid payloads would look like the following:

{
  "sensorValue": {
    "sensorState": { "stateCode": "OFFLINE" }
  }
}

or

{
  "sensorValue": {
    "sensorState": { "stateCode": "FAILURE" }
  }

or

{
  "sensorValue": {
    "sensorState": { "stateCode": "SUCCESS" },
    "sensorMeasurement": "23.5"
  }
}

Making intention and context semantics explicit is better than encoding as null or NaN.

mristin commented 1 year ago

@atextor thanks for the replies!

Just a small clarification: I didn't mean to ignore JavaScript in general; just that it is a poor choice for a north star of the design. Your solution needs to live at the intersection of all major languages -- if you only considered JavaScript, you would miss a lot of intricacies and issues with other languages.

After all that said, I'd advise you to take a step back and think about the data types you use in your model in a more holistic manner. They are the very foundation -- thus the typing system needs to be a solid one.

Instead of making a Frankenstein where you insist on XML data types, but then introduce tons of unnecessary leaky abstractions depending on which serialization is used, I'd recommend you to determine the core set of primitive types that you need, and support only that set.

That way you will avoid the problems with leaky abstractions, but also keep round-trips sane, which is super important for testing (fuzzing) & correctness (static and runtime invariants). The core set of types would be also much easier to digest and reason about, avoiding the confusion for the reader.

mristin commented 1 year ago

P.S. sorry, I forgot to clarify the point related to floats and doubles. Please take into account that the chain double -> string -> double often do not guarantee the round-trip (g17 representation might help) unless you always use the same double-string de/serialization (often not a given!). For example, setializing in C# and de-serializing in Python might give you a different number at the end.

The problem is exacerbated in the chain float -> string -> double -> float.