SEMICeu / DCAT-AP

This is the issue tracker for the maintenance of DCAT-AP
https://joinup.ec.europa.eu/solution/dcat-application-profile-data-portals-europe
72 stars 24 forks source link

:Category_Property_skos_prefLabel finds issues in additional OWL data #248

Closed volkerjaenisch closed 2 weeks ago

volkerjaenisch commented 1 year ago

Dear SEMICeu!

I am using the ITB-Validator with

dcat-ap_2.1.1_shacl_shapes.ttl and dcat-ap-de-imports.ttl

and also for comparison pySHACL with

shapes: dcat-ap_2.1.1_shacl_shapes.ttl and owl_graph: dcat-ap-de-imports.ttl

In both cases the validation reports violations in the OWL data.

Shape

:Category_Property_skos_prefLabel
    sh:minCount 1 ;
    sh:nodeKind sh:Literal ;
    sh:path skos:prefLabel ;
    sh:severity sh:Violation .

leads to

 [First of 5 occurrences] Property needs to have at least 1 value
Location:[Focus node] - [http://publications.europa.eu/resource/authority/dataset-type/NAL] - 
[Result path] - [http://www.w3.org/2004/02/skos/core#prefLabel]
Test:[Shape] - [http://data.europa.eu/r5r#Category_Property_skos_prefLabel]

Yes, the violations in the OWL data are surely there. But how does this help my software to validate the payload DCAT-AP data? I cannot change the OWL files - but at the end of the day my software has to deliver correct DCAT-AP data.

This problem stems from the fact that SHACL validators enrich the DCAT-AP payload graph with the OWL data and then validate the complete graph. But this is a technical detail and no real excuse.

My algorithm to make the data DCAP-AP compatible is to delete the properties/nodes from the graph the validator marks as a violation. Then the graph is validated again and the process iterates till no violations are found.

This leads to a sub optimal result if violations in OWL data happen. For instance

http://publications.europa.eu/resource/authority/dataset-type/NAL

has no required prefLabel and is thus deleted. Therefore the DCAT dataset utilizing this artifact may become invalid since it may be a required property.

And this behavior is IMHO wrong. The data provider has correctly chosen the artifact http://publications.europa.eu/resource/authority/dataset-type/NAL. He may even be forced to use this artifact by another SHACL rule. Therefore he should not be punished by discarding his dataset due to a faulty OWL definition out of his DCAT-AP data scope.

Her other examples of violations from other OWL data.

[First of 16 occurrences] Property needs to have at least 1 value
Location:[Focus node] - [http://purl.org/adms/interoperabilitylevel/1.0] - 
[Result path] - [http://purl.org/dc/terms/title]
Test:[Shape] - [http://data.europa.eu/r5r#CategoryScheme_Property_dct_title]

The former ones from ITB testbench, the following from pySHACL

Constraint Violation in MinCountConstraintComponent (http://www.w3.org/ns/shacl#MinCountConstraintComponent):
    Severity: sh:Violation
    Source Shape: :CategoryScheme_Property_dct_title
    Focus Node: <http://dcat-ap.de/def/politicalGeocoding/districtKey>
    Result Path: dct:title
    Message: Less than 1 values on <http://dcat-ap.de/def/politicalGeocoding/districtKey>->dct:title

Constraint Violation in MinCountConstraintComponent (http://www.w3.org/ns/shacl#MinCountConstraintComponent):
    Severity: sh:Violation
    Source Shape: :CategoryScheme_Property_dct_title
    Focus Node: <http://purl.org/adms/licencetype/1.0>
    Result Path: dct:title
    Message: Less than 1 values on <http://purl.org/adms/licencetype/1.0>->dct:title

Constraint Violation in MinCountConstraintComponent (http://www.w3.org/ns/shacl#MinCountConstraintComponent):
    Severity: sh:Violation
    Source Shape: :CategoryScheme_Property_dct_title
    Focus Node: <http://purl.org/adms/assettype/1.0>
    Result Path: dct:title
    Message: Less than 1 values on <http://purl.org/adms/assettype/1.0>->dct:title

Constraint Violation in MinCountConstraintComponent (http://www.w3.org/ns/shacl#MinCountConstraintComponent):
    Severity: sh:Violation
    Source Shape: :CategoryScheme_Property_dct_title
    Focus Node: <http://dcat-ap.de/def/politicalGeocoding/municipalAssociationKey>
    Result Path: dct:title
    Message: Less than 1 values on <http://dcat-ap.de/def/politicalGeocoding/municipalAssociationKey>->dct:title

Constraint Violation in MinCountConstraintComponent (http://www.w3.org/ns/shacl#MinCountConstraintComponent):
    Severity: sh:Violation
    Source Shape: :CategoryScheme_Property_dct_title
    Focus Node: <http://dcat-ap.de/def/politicalGeocoding/Level>
    Result Path: dct:title
    Message: Less than 1 values on <http://dcat-ap.de/def/politicalGeocoding/Level>->dct:title

I am new to SHACL and maybe I see things from the wrong direction or do not understand them at all.

Any help appreciated

Volker

bertvannuffelen commented 1 year ago

@volkerjaenisch my apologizes for the late answer.

Firstly, I would like to make the statement that "validation is complex". Validation is always a (a) check against a (b) collection of constraints in a (c) data exchange context. All 3 elements (a), (b) and (c) are subject to an agreement, which in the practice differs per case.

(a) depends on the engine, data is provided to check and the format E.g. I have seen XML based DCAT-AP compliant validation checkers which cannot differentiate between an empty string or an absent value.

(b) if you share data in a json(-ld) way then is common for coded values only exchange the code `licence:cc-by-40' while the semantics expect that this code is a LicenceDocument with additional properties. As this is background knowledge sender and reciever might have an parallell agreement in the exchange that the code is correctly modeled. And thus these constraints are not included in the validation collection.

(c) the exchange context may impact a lot: Suppose 2 parties PA and PB are sharing data with a third PC, then one assumption might be that the data provided by PA is disjoint from PB. But is that is not the case, then other validation expectations might happen.

In your case, you sketched the case for codelists. DCAT-AP takes into account the full scope of data exchange. In that case exchanging codes '2132312' is not contributing to the understanding of the data. To stimulate the code publishers to take that aspect into account, DCAT-AP sets a requirement for a human readable title (conceptschemes) and prefLabel (concepts). Now for the harvesting process, or pure data analysis this quality constraint is not relevant, however if you want as receiver build a human readable view, it becomes crucial.

The shacl constraints of DCAT-AP cannot make a combination of (a), (b) and (c) that matches everyone's situation. That is up to you; to build the right fit for purpose combination. DCAT-AP only provides the menu, not what you have selected.

Back to your concrete case: to resolve the validation errors either (1) reach out to the codelist owners to support your constraints, (2) reduce the collection of validation rules, or perform both.