Support NaN, INF, -INF number types in ValidateInstance

filipGG commented 1 day ago

Using ValidateInstance on DTDoubleInfo with either NaN, Infinity or -Infinity currently results in a violation, since this check is being made:

            // In my case when using JsonSerializer.Serialize with 
            // NumberHandling = JsonNumberHandling.AllowNamedFloatingPointLiterals
            // double.NaN becomes a JSON string "NaN"

            if (instanceElt.ValueKind != JsonValueKind.Number)
            {
                violations.Add($">>{instanceElt.GetRawText()}<< is not a numeric value");
                return false;
            }

I have a few cases where some of my doubles can be double.NaN or double.PositiveInfinity according to W3 NaN and infinity values are allowed: https://www.w3.org/TR/xmlschema-2/#double

I guess my suggestion would be to change the ValidateInstance methods to something like:

        private bool ValidateInstanceV2(JsonElement instanceElt, string instanceName, List<string> violations)
        {
            if (!double.TryParse(instanceElt.GetRawText(), out double val))
            {
                violations.Add($"{instanceElt.GetRawText()} does not conform to the XSD definition of 'double'");
                return false;
            }

            return true;
        }

Do you see any issues with making this change?

jrdouceur commented 1 day ago

I'm afraid there are a number of issues with such a code change.

First off, the code for ValidateInstanceV2 (and most of the code in the DTDLParser) is code-generated from a DTDL metamodel. In particular, the check if (instanceElt.ValueKind != JsonValueKind.Number) is generated because of this line in the DTDL v2 metamodel. The only disciplined way to make this change would be to modify the DTDL v2 metamodel to remove this line, and then let the ParserGenerator produce updated validation code.

Which brings up the next issue: The DTDL v2 language definition (which is codified in the metamodel) has been locked down for quite a few years, as has the DTDL v3 language definition. In theory, we could omit the "dtmm:jsonType": "dtmm:jsonNumber" property from the definition of dtmi:dtdl:class:Double;4 when we release DTDL v4; however, this would be a fundamental change to a primitive schema type, which is a much deeper change than we normally wish to impose when versioning.

However, I can see a potential way to address this use case. We could define a new DTDL language extension that provides an adjunct type that could be used to indicate that non-numeric values are permitted for specific float or double schema values. (The ParserGenerator would need to understand the extension and update the DTDLParser validation code accordingly.) In your model, you could include the extension context and add the co-type to any schema for which you wish non-numeric values to be accepted as valid. There are a number of ramifications that need to be investigated to assess the tenability of this approach, but the basic idea seems to be sound.

I need to be clear that I'm not in a position wherein I can promise to investigate, define, design, or implement such an extension, nor can I realistically assess a timeframe for when such work could be done. But it would be helpful for me to know whether you think an extension could address your use case sufficiently. If so, I can propose it to the DTDL steering committee, and we can see where the idea goes.

filipGG commented 13 hours ago

Thanks for the detailed explanation!

I'm stuck having to use DTDL v2 for the moment so language extensions won't work for me. Also my current approach is perhaps not optimal anyways, as i have all my instances in defined types in memory. I just serialize them for the validation step:

    private List<string> ValidateProperty(object value, DTPropertyInfo dtPropertyInfo)
    {
        var serialized = JsonSerializer.Serialize(value, _serializerOptions);
        return [.. dtPropertyInfo.Schema.ValidateInstance(serialized)];
    }

So for now I think I will continue with a custom solution instead, or at least handle double and float separately.

Perhaps for future versions of DTDL it could be worthwhile to support the valid non-numeric values for float/double since they are valid according to the standard, but i understand that it would be fundamental change. This feels more right than adding an extension, but this is just my amateur opinion :)

jrdouceur commented 10 hours ago

Well, technically, DTDL double is defined as, "a finite numeric value that is expressible in IEEE 754 double-precision floating point format, conformant with xsd:double". NaN and infinity are not finite numeric values. So, I think we would be extending the expressible range of DTDL double by allowing these non-finite values.

Moreover, there does not seem to be a formal standard for expressing NaN or +/-Infinity in JSON. There are JSON implementations that use de facto string representations, but these are not consistent across libraries. Support is also inconsistent across other serialization formats. For example, Protobuf permits Nan and Infinity for double fields but not for value fields that are doubles. The safest, most general approach seems to be disallowing non-finite values for the core double schema.

The idea of an extension intrigued me enough that I spent a little time yesterday sketching out an extension metamodel, which turned out to be pretty straightforward. I do intend to raise the issue with the DTDL Steering Committe, since I can see the benefit in supporting non-finite values for some use cases.

filipGG commented 8 hours ago

Hmm I'm a bit confused... searching for xsd:double definition lands me here: https://www.w3.org/TR/xmlschema11-2/#double and here: http://www.datypic.com/sc/xsd/t-xsd_double.html

and both these sources says that NaN and Inf are valid values for xsd:double or I'm I looking at the wrong thing?

If the sources are correct then the definition of a DTDL double is kind of misleading since it's not conforming to xsd:double as it claims in the documentation: https://github.com/Azure/opendigitaltwins-dtdl/blob/master/DTDL/v2/DTDL.v2.md#primitive-schema

The safest, most general approach seems to be disallowing non-finite values for the core double schema.

I understand why. Does not seem to be any great ways to solve this with JSON

jrdouceur commented 8 hours ago

You are correct about the definition of xsd:double. I think the phrasing of the DTDL double description is perhaps ambiguous. The description states three conditions:

finite numeric value
expressible in IEEE 754 double-precision floating point format
conformant with xsd:double

The intent was that all three conditions must hold, but the phrase structure does not make this plain.

filipGG commented 7 hours ago

Not sure it's ambiguous... does not 1 and 3 just contradict each other? 1 says NaN and Inf are invalid but 3 says they are valid

jrdouceur commented 7 hours ago

I think we can agree that the phraseology needs reworking.

digitaltwinconsortium / DTDLParser

Support NaN, INF, -INF number types in ValidateInstance #142