RDFLib / pySHACL

A Python validator for SHACL
Apache License 2.0
245 stars 63 forks source link

Problem when validating xsd:float #140

Closed tobiasschweizer closed 2 years ago

tobiasschweizer commented 2 years ago

Hi there,

Validating an xsd:float gives me an unexpected validation report. I am using "PySHACL Version: 0.19.0".

Example:

shapes graph "shapes.json":

{
  "@context": {
    "owl": "http://www.w3.org/2002/07/owl#",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "xsd": "http://www.w3.org/2001/XMLSchema#",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "prov": "http://www.w3.org/ns/prov#",
    "dcat": "http://www.w3.org/ns/dcat#",
    "sh": "http://www.w3.org/ns/shacl#",
    "shsh": "http://www.w3.org/ns/shacl-shacl#",
    "dcterms": "http://purl.org/dc/terms/",
    "schema": "http://schema.org/",
    "rescs": "http://rescs.org/"
  },
  "@graph": [
    {
      "@id": "rescs:dash/monetaryamount/MonetaryAmountShape",
      "@type": "sh:NodeShape",
      "rdfs:comment": {
        "@type": "xsd:string",
        "@value": "A monetary value or range. This type can be used to describe an amount of money such as $50 USD, or a range as in describing a bank account being suitable for a balance between £1,000 and £1,000,000 GBP, or the value of a salary, etc. It is recommended to use [[PriceSpecification]] Types to describe the price of an Offer, Invoice, etc."
      },
      "rdfs:label": {
        "@type": "xsd:string",
        "@value": "Monetary amount"
      },
      "sh:property": {
        "sh:datatype": {
          "@id": "xsd:float"
        },
        "sh:description": "The value of the quantitative value or property value node.\\\\n\\\\n* For [[QuantitativeValue]] and [[MonetaryAmount]], the recommended type for values is 'Number'.\\\\n* For [[PropertyValue]], it can be 'Text;', 'Number', 'Boolean', or 'StructuredValue'.\\\\n* Use values from 0123456789 (Unicode 'DIGIT ZERO' (U+0030) to 'DIGIT NINE' (U+0039)) rather than superficially similiar Unicode symbols.\\\\n* Use '.' (Unicode 'FULL STOP' (U+002E)) rather than ',' to indicate a decimal point. Avoid using these symbols as a readability separator.",
        "sh:maxCount": {
          "@type": "xsd:integer",
          "@value": 1
        },
        "sh:minCount": {
          "@type": "xsd:integer",
          "@value": 1
        },
        "sh:minExclusive": 0,
        "sh:name": "value",
        "sh:path": {
          "@id": "schema:value"
        }
      },
      "sh:targetClass": {
        "@id": "schema:MonetaryAmount"
      }
    }
  ]
}

data sample "monetaryamount.json":

{
  "@context": {
    "@vocab": "http://schema.org/",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },
  "@type": "MonetaryAmount",
  "value": {
    "@type": "xsd:float",
    "@value": 100000
  }
}

pyshacl -sf json-ld -s shapes.json -df json-ld monetaryamount.json gives me:

Validation Report Conforms: False Results (1): Constraint Violation in DatatypeConstraintComponent (http://www.w3.org/ns/shacl#DatatypeConstraintComponent): Severity: sh:Violation Source Shape: [ sh:datatype xsd:float ; sh:description Literal("The value of the quantitative value or property value node.\n\n For [[QuantitativeValue]] and [[MonetaryAmount]], the recommended type for values is 'Number'.\n For [[PropertyValue]], it can be 'Text;', 'Number', 'Boolean', or 'StructuredValue'.\n Use values from 0123456789 (Unicode 'DIGIT ZERO' (U+0030) to 'DIGIT NINE' (U+0039)) rather than superficially similiar Unicode symbols.\n Use '.' (Unicode 'FULL STOP' (U+002E)) rather than ',' to indicate a decimal point. Avoid using these symbols as a readability separator.") ; sh:maxCount Literal("1", datatype=xsd:integer) ; sh:minCount Literal("1", datatype=xsd:integer) ; sh:minExclusive Literal("0", datatype=xsd:integer) ; sh:name Literal("value") ; sh:path schema1:value ] Focus Node: [ :value Literal("100000", datatype=xsd:float) ; rdf:type :MonetaryAmount ] Value Node: Literal("100000", datatype=xsd:float) Result Path: schema1:value Message: Value is not Literal with datatype xsd:float

Changing the @value to 100000.0 or "100000" makes it pass. However, I think all three variants should be valid, no?

I tried the example above on https://shacl.org/playground/ which worked fine.

Could you tell me whether I am doing something wrong or this is a bug?

Thanks a lot!

tobiasschweizer commented 2 years ago

For example, -1E4, 1267.43233E12, 12.78e-2, 12 , -0, 0 and INF are all legal literals for float.

https://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#float

My assumption is that 100000 is implicitly 100000.0 when typed as xsd:float.

Could it be that 100000 is actually represented as an int in Python?

https://github.com/RDFLib/pySHACL/blob/3d677893199124de51d13de6f373d1d35c4a6cdf/pyshacl/constraints/core/value_constraints.py#L214-L215

ashleysommer commented 2 years ago

Hi @tobiasschweizer Thanks for the bug report. I think this is a bug in the RDFLib JSON-LD parser. Is it possible for you to test the same example but encoded in Turtle format, to see if the issue remains?

tobiasschweizer commented 2 years ago

Sure, I will try this and come back to you asap.

tobiasschweizer commented 2 years ago

I tried the following which worked fine:

"monetaryamount.ttl"

<http://www.example.com/1> a <http://schema.org/MonetaryAmount> ;
  <http://schema.org/value> "100000"^^<http://www.w3.org/2001/XMLSchema#float> .

pyshacl -sf json-ld -s shapes.json -df turtle monetaryamount.ttl
Validation Report Conforms: True

ashleysommer commented 2 years ago

Ok, great. Thanks, that confirms the bug lies in the JSON-LD parser. I'll create a corresponding bug in the RDFlib bug tracker.

tobiasschweizer commented 2 years ago

Ok, thanks. Let me know if I can be of further assistance to substantiate the report.

ashleysommer commented 2 years ago

Hi @tobiasschweizer I finally got a chance to do some testing on this. A simple test:

    my_json = """
{
  "@context": {
    "@vocab": "http://schema.org/",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },
  "@type": "MonetaryAmount",
  "value": {
    "@type": "xsd:float",
    "@value": 100000
  }
}
    """
    g = rdflib.Graph()
    g.parse(data=my_json, format="json-ld")
    g.print()

This prints

@prefix : <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

[] a :MonetaryAmount ;
    :value "100000"^^xsd:float .

So it appears there is no bug in the JSON-LD parser, it parses the amount to a float, and when serializing back into turtle, it remains a float. So the issue must lie elsewhere. I'll look into it further.

ashleysommer commented 2 years ago

Ok, Ive worked out one key difference between the json-ld example and the turtle example.

Even though the datatype of both is xsd:float, the "lexical value" of the data in the Turtle value is a string ("1000"), and the lexical of the json-ld version is an integer (1000).

When setting up a Literal value, RDFLib has the ability to parse a lexical string into a real value matching the datatype, but only when the lexical value is a string.

This in the past has never been an issue because in Turtle and other RDF data formats, the lexical value of any typed value is always a string. But in JSON-LD, it can clearly be something other than a string.

As a simple example, replace @value string in your json-ld:

{
  "@context": {
    "@vocab": "http://schema.org/",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },
  "@type": "MonetaryAmount",
  "value": {
    "@type": "xsd:float",
    "@value": "100000"
  }
}

You will see your example now passes as expected.

So now I understand that this issue lies somewhere in between the json-ld parser, and RDFLib's handling of Literal lexical values. It could possibly be fixed by adding an extra translation step in the json-ld parser, or it could be fixed by adding an extra conversion of non-string lexicals in RDFLib, or it might be easier to fix it at the PySHACL level, and modify how the datatype constraint works, allowing more kinds of values for xsd:float and xsd:double.

ajnelson-nist commented 2 years ago

Hi @ashleysommer ,

Apologies for butting in, but I saw a notice for this fly by and remembered an issue with default datatypes I'd encountered a while ago. Some data from my community was getting flagged after we had a "All non-integer numbers are now xsd:decimal" decision. The standards-section citations are in this commit:

https://github.com/casework/CASE-Examples/commit/af9d622ec5e693ce0a19627199baaaef0bbc5f27

ashleysommer commented 2 years ago

Thanks @ajnelson-nist Thats great to see. Personally I too always try to use xsd:decimal wherever possible rather than xsd:float or xsd:double. Floats and Doubles are plagued by implementation issues, they are treated differently in different programming languages, and it easy to run into the issue we see in this thread. Eg, should the lexical of 100000 be converted to float? A float in Python is actually really a double. So given the datatype is xsd:float, should it still fail validation? Should it really be xsd:double? I believe the current way that RDFLib handles it is probably fine. After all, there's nothing stopping you from writing: "cat"^^xsd:float And rdflib will happily accept that as a real Literal value, because that's what you've specified, and the value will still be "cat", and the datatype will still be xsd:float. But it would fail the PySHACL datatype constraint of xsd:float.

Similarly, as per the issue described above, the lexical is an int, but the datatype is a xsd:float, RDFLib doesn't care, the value is still an int, and the datatype is still xsd:float, but as we see, it does fail the datatype constraint.

Given that xsd:decimal will always have a lexical form of a string (because there are some decimals that cannot be represented as an int, float, or double) and RDFlib will parse it to a python Decimal when loaded, then adopting this practice will solve the class of issues seen here.

tobiasschweizer commented 2 years ago

Ok, Ive worked out one key difference between the json-ld example and the turtle example.

Even though the datatype of both is xsd:float, the "lexical value" of the data in the Turtle value is a string ("1000"), and the lexical of the json-ld version is an integer (1000).

When setting up a Literal value, RDFLib has the ability to parse a lexical string into a real value matching the datatype, but only when the lexical value is a string.

This in the past has never been an issue because in Turtle and other RDF data formats, the lexical value of any typed value is always a string. But in JSON-LD, it can clearly be something other than a string.

As a simple example, replace @value string in your json-ld:

{
  "@context": {
    "@vocab": "http://schema.org/",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },
  "@type": "MonetaryAmount",
  "value": {
    "@type": "xsd:float",
    "@value": "100000"
  }
}

You will see your example now passes as expected.

So now I understand that this issue lies somewhere in between the json-ld parser, and RDFLib's handling of Literal lexical values. It could possibly be fixed by adding an extra translation step in the json-ld parser, or it could be fixed by adding an extra conversion of non-string lexicals in RDFLib, or it might be easier to fix it at the PySHACL level, and modify how the datatype constraint works, allowing more kinds of values for xsd:float and xsd:double.

Thanks @ashleysommer for looking into this. So if I understand correctly, instead of "@value": 100000 we could simply write "@value": "100000" to sidestep the problem.

So maybe the source of the problem lies in the isinstance check as mentioned above? 100000 is represented as an int in Python which is not an instance of float.

Maybe the relations of numeric types need be taken into account here. I am no Python expert but I remember in Java you could assign an int to a variable of type double but not the opposite. So wouldn't the solution be to accept both int and float when doing the check for xsd:float?

tobiasschweizer commented 2 years ago

@ajnelson-nist this is somehow off-topic but aren't you working on https://github.com/lambdamusic/Ontospy/pull/107? :-)

ashleysommer commented 2 years ago

So if I understand correctly, instead of "@value": 100000 we could simply write "@value": "100000" to sidestep the problem.

Thats right. If it is possible to do that in your datafiles, that is the easiest way forward.

It works because when RDFLib processes a new Literal object, it has special rules for if the lexical value is a string. When it is a string, but the literal has a known XSD datatype attached, then RDFLib will attempt to parse the string into that format. So the value of the literal will be 100000 as a python float. On the other hand, when the lexical is an int, then RDFLib doesn't know it can convert it, so it keeps the value as an int.