SEMICeu / DCAT-AP

This is the issue tracker for the maintenance of DCAT-AP
https://joinup.ec.europa.eu/solution/dcat-application-profile-data-portals-europe
76 stars 24 forks source link

SHACL :DateOrDateTimeDataType_Shape does not allow for xsd:dateTimeStamp #238

Closed volkerjaenisch closed 3 months ago

volkerjaenisch commented 1 year ago

Dear SEMICeu!

We are currently integrating the ISAITB SHACL validator into our portal. While testing we stumbled over

sh:result    [ rdf:type                      sh:ValidationResult ;
                 sh:focusNode                  <https://geobasis-bb.de#dcat_Dataset_568978c5-fa73-48d1-a6f9-487aabdc1aef> ;
                 sh:resultMessage              "dcat:Dataset: dct:modified MUSS ein als xsd:date, xsd:dateTime, xsd:gYear oder xsd:gYearMonth getyptes Literal sein. Es DARF maximal einmal vorhanden sein."@de ;
                 sh:resultPath                 dcterms:modified ;
                 sh:resultSeverity             sh:Violation ;
                 sh:sourceConstraintComponent  sh:NodeConstraintComponent ;
                 sh:sourceShape                dcatap:Dataset_Property_dct_modified ;
                 sh:value                      "2022-11-17T09:37:25.626789"^^xsd:dateTimeStamp
               ] ;

The error message is in German since we use some additions from GovData. So I checked the online version https://www.itb.ec.europa.eu/shacl/dcat-ap/upload showing the same problem:

  sh:result    [ rdf:type                      sh:ValidationResult ;
                 sh:focusNode                  <https://geobasis-bb.de#dcat_Dataset_6de7c97b-152a-4fef-8f3e-1dbbf560adf4> ;
                 sh:resultMessage              "Value does not have shape :DateOrDateTimeDataType_Shape" ;
                 sh:resultPath                 dcterms:modified ;
                 sh:resultSeverity             sh:Violation ;
                 sh:sourceConstraintComponent  sh:NodeConstraintComponent ;
                 sh:sourceShape                _:b3 ;
                 sh:value                      "2022-11-17T09:40:11.539595"^^xsd:dateTimeStamp
               ] ;

The processing shape is from dcat-ap_2.1.1_shacl_shapes.ttl :

:DateOrDateTimeDataType_Shape
    a sh:NodeShape ;
    rdfs:comment "Date time date disjunction shape checks that a datatype property receives a temporal value: date, dateTime, gYear or gYearMonth literal" ;
    rdfs:label "Date time date disjunction" ;
    sh:message "The values must be data typed as either xsd:date, xsd:dateTime, xsd:gYear or xsd:gYearMonth" ;
    sh:or ([
            sh:datatype xsd:date
        ]
        [
            sh:datatype xsd:dateTime
        ]
        [
            sh:datatype xsd:gYear
        ]
        [
            sh:datatype xsd:gYearMonth
        ]
    ) .

xsd:dateTimeStamp inherits from xsd:dateTime (http://www.datypic.com/sc/xsd11/t-xsd_dateTimeStamp.html). And therefore the shape looks correct to me.

I guess there is an OWL file missing which defines the inheritance via rdfs:subClassOf.

We have roughly 10000 DCAT-ap.de files with xsd:dateTimeStamp from our ISO19115 harvest. So any help is appreciated.

Cheers, Volker

For reference the dataset


@prefix dct: <http://purl.org/dc/terms/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dcatde: <http://dcat-ap.de/def/dcatde/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix adms: <http://www.w3.org/ns/adms#> .
@prefix owl: <http://www.w3.org/2002/07/owl> .
@prefix schema: <http://schema.org/> .
@prefix spdx: <http://spdx.org/rdf/terms#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix vcard: <http://www.w3.org/2006/vcard/ns#> .
@prefix rdf4j: <http://rdf4j.org/schema/rdf4j#> .
@prefix sesame: <http://www.openrdf.org/schema/sesame#> .
@prefix fn: <http://www.w3.org/2005/xpath-functions#> .

<https://geobasis-bb.de#dcat_Dataset_568978c5-fa73-48d1-a6f9-487aabdc1aef> a dcat:Dataset;
  dct:description "Für die Digitalen Topographischen Karten werden Vektordaten aus dem Basis-DLM generalisiert und nach dem ATKIS-Signaturenkatalog bearbeitet. Die digitalen Daten können per Download oder auf anderen Medienträgern abgegeben werden. Sie liegen in max. 22 Inhaltsebenen (nach dem techn. Regelwerk der AdV) in drei Ausprägungen (Einzelebenen, Graukombination und Farbkombination) vor. Es gilt zu beachten, dass ein UTM-Gitter nur in den Einzelebenen ausgegeben wird. Die Standardauflösung beträgt 200L/cm = 508dpi. Eine Kartenausgabe gleichen Inhalts stellt die TK (ATKIS) als gedruckte Karte dar. Die Daten werden über automatisierte Verfahren oder durch Selbstentnahme kostenfrei bereitgestellt. Bei Nutzung der Daten sind die Lizenzbedingungen zu beachten."@de;
  dct:identifier "568978c5-fa73-48d1-a6f9-487aabdc1aef";
  adms:identifier "568978c5-fa73-48d1-a6f9-487aabdc1aef";
  dct:modified "2022-11-17T09:37:25.626789"^^xsd:dateTimeStamp;
  dct:publisher <https://geobasis-bb.de#foaf_Agent_568978c5-fa73-48d1-a6f9-487aabdc1aef>;
  dct:title "Digitale Topographische Karte 1 : 10 000 - 3846-SO Zossen - Neuhof"@de;
  dcat:contactPoint <https://geobasis-bb.de#vcard_Kind_568978c5-fa73-48d1-a6f9-487aabdc1aef>;
  dcat:theme <http://publications.europa.eu/resource/authority/data-theme/TECH>, <http://publications.europa.eu/resource/authority/data-theme/GOVE>,
    <http://publications.europa.eu/resource/authority/data-theme/REGI>, <http://publications.europa.eu/resource/authority/data-theme/ENVI>,
    <http://publications.europa.eu/resource/authority/data-theme/AGRI>, <http://inspire.ec.europa.eu/theme/lc>;
  dcat:distribution <https://geobasis-bb.de/lgb/de/geodaten/topographische-karten/top-karten-1-10000/>,
    <https://data.geobasis-bb.de/geobasis/information/legenden/legende_dtk10.pdf>, <https://geobroker.geobasis-bb.de/gbss.php?MODE=GetProductInformation&PRODUCTID=84579219-6849-4c89-90d0-aa7db3f26fa8>,
    <https://data.geobasis-bb.de/geobasis/daten/dtk/dtk10/ebenen/dtk10_ebenen_3846-so.zip>,
    <https://data.geobasis-bb.de/geobasis/daten/dtk/dtk10/kombination/dtk10_3846-so.zip>;
  dcatde:contributorID <http://dcat-ap.de/def/contributors/landBrandenburg>;
  dct:issued "2022-11-17T09:37:25.626872"^^xsd:dateTimeStamp;
  dcat:keyword "opendata"@de, "Vermessung"@de, "Karte"@de, "Verkehr"@de, "1:10.000"@de,
    "Bodenbedeckung"@de, "Rasterdaten"@de, "DTK10"@de, "DTK10FAR"@de, "DTK10GRA"@de, "3846-SO"@de;
  foaf:page <https://geobasis-bb.de/lgb/de/geodaten/topographische-karten/top-karten-1-10000/>,
    <https://data.geobasis-bb.de/geobasis/information/legenden/legende_dtk10.pdf>;
  <http://inqbus.de/nspriority> 30;
  dct:spatial <https://geobasis-bb.de#dct_Location_568978c5-fa73-48d1-a6f9-487aabdc1aef>;
  dct:accrualPeriodicity <http://publications.europa.eu/resource/authority/frequency/CONT>;
  dct:isPartOf <https://geobasis-bb.de#dcat_Dataset_84579219-6849-4c89-90d0-aa7db3f26fa8> .
volkerjaenisch commented 1 year ago

Dear SEMICeu!

I tried the DCAT-ap shapes in two of other SHACL processors:

pySHACL imports the shapes quite well and did a validation quite comparable to that of the ISAITB SHACL validator.

Validation Report
Conforms: False
Results (20):
Constraint Violation in NodeConstraintComponent (http://www.w3.org/ns/shacl#NodeConstraintComponent):
    Severity: sh:Violation
    Source Shape: :Dataset_Property_dct_issued
    Focus Node: <https://geobasis-bb.de#dcat_Dataset_568978c5-fa73-48d1-a6f9-487aabdc1aef>
    Value Node: Literal("2022-11-17T09:37:25.626872" = None, datatype=xsd:dateTimeStamp)
    Result Path: dct:issued
    Message: Value does not conform to Shape :DateOrDateTimeDataType_Shape

Process finished with exit code 1

RDF4J throws an Exception. and graphDB the same. No clue what this means. I will open a Bugreport at RDF4J.

javax.servlet.ServletException: org.eclipse.rdf4j.repository.RepositoryException: Shape with multiple types: <http://www.w3.org/ns/shacl#PropertyShape>, <http://www.w3.org/ns/shacl#NodeShape>
    org.eclipse.rdf4j.workbench.proxy.WorkbenchServlet.handleRequest(WorkbenchServlet.java:160)
    org.eclipse.rdf4j.workbench.proxy.WorkbenchServlet.service(WorkbenchServlet.java:112)
    org.eclipse.rdf4j.workbench.proxy.WorkbenchGateway.service(WorkbenchGateway.java:117)
    org.eclipse.rdf4j.workbench.base.AbstractServlet.service(AbstractServlet.java:129)
    org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53)
    org.eclipse.rdf4j.workbench.proxy.CacheFilter.doFilter(CacheFilter.java:64)
    org.eclipse.rdf4j.workbench.proxy.CookieCacheControlFilter.doFilter(CookieCacheControlFilter.java:56)

Root Cause

org.eclipse.rdf4j.repository.RepositoryException: Shape with multiple types: <http://www.w3.org/ns/shacl#PropertyShape>, <http://www.w3.org/ns/shacl#NodeShape>
    org.eclipse.rdf4j.http.client.SPARQLProtocolSession.execute(SPARQLProtocolSession.java:1095)
    org.eclipse.rdf4j.http.client.SPARQLProtocolSession.executeNoContent(SPARQLProtocolSession.java:1049)
    org.eclipse.rdf4j.http.client.RDF4JProtocolSession.upload(RDF4JProtocolSession.java:1103)
    org.eclipse.rdf4j.http.client.RDF4JProtocolSession.upload(RDF4JProtocolSession.java:928)
    org.eclipse.rdf4j.http.client.RDF4JProtocolSession.upload(RDF4JProtocolSession.java:919)
    org.eclipse.rdf4j.repository.http.HTTPRepositoryConnection.add(HTTPRepositoryConnection.java:447)
    org.eclipse.rdf4j.workbench.commands.AddServlet.add(AddServlet.java:94)
    org.eclipse.rdf4j.workbench.commands.AddServlet.doPost(AddServlet.java:53)
    org.eclipse.rdf4j.workbench.base.TransformationServlet.service(TransformationServlet.java:98)
    org.eclipse.rdf4j.workbench.base.AbstractServlet.service(AbstractServlet.java:129)
    org.eclipse.rdf4j.workbench.proxy.ProxyRepositoryServlet.service(ProxyRepositoryServlet.java:100)
    org.eclipse.rdf4j.workbench.proxy.WorkbenchServlet.service(WorkbenchServlet.java:215)
    org.eclipse.rdf4j.workbench.proxy.WorkbenchServlet.handleRequest(WorkbenchServlet.java:137)
    org.eclipse.rdf4j.workbench.proxy.WorkbenchServlet.service(WorkbenchServlet.java:112)
    org.eclipse.rdf4j.workbench.proxy.WorkbenchGateway.service(WorkbenchGateway.java:117)
    org.eclipse.rdf4j.workbench.base.AbstractServlet.service(AbstractServlet.java:129)
    org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53)
    org.eclipse.rdf4j.workbench.proxy.CacheFilter.doFilter(CacheFilter.java:64)
    org.eclipse.rdf4j.workbench.proxy.CookieCacheControlFilter.doFilter(CookieCacheControlFilter.java:56)

This outcome is good as well bad. Good since the problem is not bound to a certain SHACL processor.
Bad because we do not know why the inheritance is neglected.

Cheers, Volker

bertvannuffelen commented 1 year ago

@volkerjaenisch, your example is another variant of the topic "inference" based validation. The issue with this is that there is somewhere outside the rules information that would allow to infer information that would satisfy the constraint.

A classical example of this are Agents. The DCAT-AP rules state that a publisher must be an Agent. An Organisation is a subclass of an Agent. Therefore a publisher p1 only having the class denotation being an Organisation would suffice to derive that p1 is an Agent. And therefore the validation rule will be satisfied if there is an inference happening that from p1 is an Organisation, p1 is also an Agent.

But then the question arises: should we include this knowledge or not? And who should supply it. E.g. Suppose I have a German classification of Agents in the form of subclasses, then why would that classification not be acceptable if one has the knowledge about the German classification?

Similarly here, this is the variant with xsd types. The validation rules indeed do not explicitly include xsd:dateTimeStamp, but only xsd:dateTime. According to the definitions in https://www.w3.org/TR/xmlschema11-2/#dateTimeStamp it is a subclass of xsd:dateTime.

SHACL provides the mean to take into account rdfs:subClassOf, but does not provide a mean to include subclass relationships that are in the literal "types".

So the only approach is to create a "full hierarchy" into the SHACL expression. While 'technically' there is no real harm in adding another case, it is not a future proof solution.

To the community, the following questions:

volkerjaenisch commented 1 year ago

Dear @bertvannuffelen !

Thank you for the detailed analysis. I agree with you, completely. I am quite interested in the answers from the community.

Nearly all our DCAT-AP.de datasets (harvested from ISO19115 data) have xsd:dateTimeStamp (dct:modified, dct:created) as their type. I see some possible ways to deal with that: 1) We change the type. This is only coping since we have several other data providers where we simply harvest their RDF. Parsing all this RDF and changing the type (which is IMHO no error) is a lot 'money for nothing'. 2) SHACL learns (implements) inheritance (inference), see also 4) 3) xsd:dateTimeStamp is included in the DCAT-AP shape 4) Some RDF/OWL snipped added to the data that can be used by SHACL to accept xsd:dateTimeStamp.

IMHO 2) is the best solution. 3) is a straight forward fix, which solves the xsd:dateTimeStamp issue and can be removed if 2) may comes up.

Solution 4) may be the most flexible way since it enables the data provider to inform the validating instance of additional knowledge for the validation.

I tried 4) to no avail. I added to the data

xsd:dateTimeStamp rdfs:subClassOf xsd:dateTime

bur no SHACL processor pySHACL/ITB does use this information, even with explicitly forced inference (pySHACL). I assume this is what you meant by

SHACL provides the mean to take into account rdfs:subClassOf, but does not provide a mean to include subclass relationships that are in the literal "types".

Due to xsd:dateTimeStamp being a primitive XML type and no RDF entity. Maybe someone wiser than myself may shed a bit light here or may even propose a working solution.

On the other hand 4) may water down the strictness and consistence of a centralized SHACL validation. This could lead to hacks to make crappy data pass the validation.

Cheers, Volker