RDFLib / pySHACL

A Python validator for SHACL
Apache License 2.0
245 stars 63 forks source link

Validating xsd:date and xsd:dateTime #151

Open tobiasschweizer opened 2 years ago

tobiasschweizer commented 2 years ago

Hi there,

I have a question regarding validation of xsd:date and xsd:dateTime. I am using pyshacl version 0.19.1.

Given the following shapes:

{
  "@context": {
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "xsd": "http://www.w3.org/2001/XMLSchema#",
    "sh": "http://www.w3.org/ns/shacl#",
    "schema": "http://schema.org/",
    "rescs": "http://rescs.org/"
  },
  "@graph": [
    {
      "@id": "rescs:dash/creativework/CreativeWorkShape",
      "@type": "sh:NodeShape",
      "rdfs:comment": {
        "@type": "xsd:string",
        "@value": "The most generic kind of creative work, including books, movies, photographs, software programs, etc."
      },
      "rdfs:label": {
        "@type": "xsd:string",
        "@value": "CreativeWork"
      },
      "sh:property": [
        {
          "sh:datatype": {
            "@id": "xsd:date"
          },
          "sh:description": "The date on which the CreativeWork was created or the item was added to a DataFeed.",
          "sh:maxCount": {
            "@type": "xsd:integer",
            "@value": 1
          },
          "sh:name": "dateCreated",
          "sh:path": {
            "@id": "schema:dateCreated"
          }
        }
      ],
      "sh:targetClass": {
        "@id": "schema:CreativeWork"
      }
    }
  ]
}

I noticed that schema:dateCreated only has to have the correct type annotation and the value has to be a string to be valid.

So this also does pass validation although it is not a xsd:date but an xsd:dateTime:

"schema:dateCreated": {
        "@type": "xsd:date",
        "@value": "2022-07-08T06:48:22.159262"
}

Does pyshacl actually check if the given value string is a valid date or is this somehow out of scope?

Thanks for your feedback!

tobiasschweizer commented 2 years ago

I tried the above with https://github.com/TopQuadrant/shacl (CLI) version 1.4.2:

./shaclvalidate.sh -datafile datetime.ttl -shapesfile creativework.ttl 14:49:39 WARN riot :: [line: 2, col: 68] Lexical form '2022-07-08T06:48:22.159262' not valid for datatype XSD date @prefix dash: http://datashapes.org/dash# . @prefix graphql: http://datashapes.org/graphql# . @prefix owl: http://www.w3.org/2002/07/owl# . @prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# . @prefix rdfs: http://www.w3.org/2000/01/rdf-schema# . @prefix schema1: http://schema.org/ . @prefix sh: http://www.w3.org/ns/shacl# . @prefix swa: http://topbraid.org/swa# . @prefix tosh: http://topbraid.org/tosh# . @prefix xsd: http://www.w3.org/2001/XMLSchema# .

[ rdf:type sh:ValidationReport ; sh:conforms false ; sh:result [ rdf:type sh:ValidationResult ; sh:focusNode https://openalex.org/W2738724892 ; sh:resultMessage "Value must be a valid literal of type date e.g. ('YYYY-MM-DD')" ; sh:resultPath schema1:dateCreated ; sh:resultSeverity sh:Violation ; sh:sourceConstraintComponent sh:DatatypeConstraintComponent ; sh:sourceShape [] ; sh:value "2022-07-08T06:48:22.159262"^^xsd:date ] ] .

datetime.ttl

<https://openalex.org/W2738724892> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/CreativeWork>.
<https://openalex.org/W2738724892> <http://schema.org/dateCreated> "2022-07-08T06:48:22.159262"^^<http://www.w3.org/2001/XMLSchema#date>

creativework.ttl

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema1: <http://schema.org/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://rescs.org/dash/creativework/CreativeWorkShape> a sh:NodeShape ;
    rdfs:label "CreativeWork"^^xsd:string ;
    rdfs:comment "The most generic kind of creative work, including books, movies, photographs, software programs, etc."^^xsd:string ;
    sh:property [ sh:datatype xsd:date ;
            sh:description "The date on which the CreativeWork was created or the item was added to a DataFeed." ;
            sh:maxCount 1 ;
            sh:name "dateCreated" ;
            sh:path schema1:dateCreated ] ;
    sh:targetClass schema1:CreativeWork .

There are two things I can see in the output:

  1. a warning about an invalid xsd:date (parsing)
  2. a SHACL error for an invalid xsd:date

Shouldn't pyshacl also report an error for this case? Or this this related to rdflib which should throw a warning for an invalid xsd:date?

Please let me know if I should provide more information about my use case. Thanks!

ashleysommer commented 2 years ago

Hi @tobiasschweizer

Sorry for the delayed response on this one.

This problem is coming from RDFLib. PySHACL uses the RDFLib library to check whether the Literal's lexical text matches its given datatype.

Note, there was some work done in this area in the lead up to the RDFLib v6.2.0 release, so the new version may have some changes that help with this issue.

Additionally, RDFLib v6.2.0 gives the ability for a Literal to be flagged as "ill-typed", that is, when a Literal's given lexical text does not match its given data type, it is flagged as "ill-typed", and PySHACL can now use this value to help complete the validation checks in the sh:datatype constraint.

There will be a new version of PySHACL out later today, (pyshacl v0.20.0), that uses RDFLib v6.2.0 by default, and takes advantage of this new "ill-typed" Literals feature, so please try that and let me know if it solves your issue.

tobiasschweizer commented 2 years ago

Hi @ashleysommer

No worries, I was on a long holiday in August and did not do anything with RDF for a while ;-)

Thanks for the heads-up. I will gladly try the new pyshacl version and let you know about the outcome.

ashleysommer commented 2 years ago

Sorry, didn't mean to automatically close this

tobiasschweizer commented 2 years ago

I've just installed pyshacl 0.20.0 and pip automatically updated rdflib to "6.2.0". However, "2022-07-08T06:48:22.159262" is still regarded a valid xsd:date.

ashleysommer commented 2 years ago

Thanks. I'll forward that up to the RDFLib team, the fix will lie with them now.

tobiasschweizer commented 1 year ago

Hi @ashleysommer ,

I've recently updated rdflib to 6.3.1 and I am now getting

in parse_date raise ISO8601Error('Unrecognised ISO 8601 date format: %r' % datestring) isodate.isoerror.ISO8601Error: Unrecognised ISO 8601 date format: ...

So it seems that rdflib performs some actual checking of dates now which is great :-).

tobiasschweizer commented 1 year ago

I figured that rdflib delegates the date literal parsing to isodate's parse_date: https://github.com/gweis/isodate/blob/8856fdf0e46c7bca00229faa1aae6b7e8ad6e76c/src/isodate/isodates.py#L118

What I found a bit surprising is that rdflib automatically adds day precision to dates with year and month precision. This behaviour comes from isodate:

For incomplete dates, this method chooses the first day for it. For instance if only a century is given, this method returns the 1st of January in year 1 of this century.

https://github.com/gweis/isodate/blob/8856fdf0e46c7bca00229faa1aae6b7e8ad6e76c/src/isodate/isodates.py#L126C1-L128C39

So this means that "2016"^^xsd:date in the original data is going to be a "2016-01-01"^^xsd:date when being validated.

ashleysommer commented 1 year ago

This behaviour comes from isodate

So this means that "2016"^^xsd:date in the original data is going to be a "2016-01-01"^^xsd:date when being validated.

Yeah, I've seen this issue come up before (in Python, outside of the RDF world). I think we would see this same issue with whichever datetime library RDFLib uses. This level of detail in RDF spec seems to be very implementation-specific.