GovDataOfficial / DCAT-AP.de-SHACL-Validation

SHACL-Shapes für DCAT-AP.de
https://www.itb.ec.europa.eu/shacl/dcat-ap.de/upload
GNU Affero General Public License v3.0
10 stars 7 forks source link

Towards a common SHACL base #10

Closed volkerjaenisch closed 2 years ago

volkerjaenisch commented 3 years ago

Hi GovData!

You made a tremendous step forward here! At least a silver stripe for a BRD wide uniform validation of DCAT-AP.de approaches the horizon. We maintained a Python based DCAT-AP-de (2000 LOC) parser for 4 years and it sucks.

We at BBG will enters the SHACL challenge, now. We will utilizing Python (PySHACL) to validate the BBG open data with the help of the SHACL-Files of this repository. Others are obviously utilizing JAVA code for the validation. Utilizing this Repo from two different languages will make it more compatible for the use of all.

We contribute:

0) I fixed a minor bug - missing dot - in dcat-ap-de-shapes-specification.ttl, which your parser obviously just ignored.

Also I enhanced the checking for dct:license and dcat:theme. 1) The former code defined an implicit NodeShape for Rule 30 and 32 which may be OK with certain parsers:

    sh:property [
        sh:path dct:license ;
        sh:node [
            sh:hasValue <http://dcat-ap.de/def/licenses> ;
            # sh:minCount 1 ;
            sh:nodeKind sh:IRI ;
            sh:path skos:inScheme ;
        ] ;

        sh:severity sh:Violation ;
        sh:message "Pflicht (K32): Distributionen von Datensätzen MÜSSEN mit einer Lizenz ausgewiesen werden. Für die Kennzeichnung der im Abschnitt 2.1 aufgeführten Lizenzen MÜSSEN die dort genannten URIs verwendet werden. - Seite 23" ;
    ] ;
.

PySHACLE is more on the safety-side and refused to work on these implicit not well defined shapes.

But IMHO the implict NodeShapes were broken nevertheless: An RDF-Node may be an IRI but then it may not have at the same time properties like skos:inScheme.

To clarify I made these implicit NodeShapes explicit. Have a look at the diff, please.

2) The logic of the shape for dct:license is IMHO to strict: At least one License and ANY license conform to DCAT-AP.de? Dual-Licensing is quite common. I moderated this to: At least one License that conforms to DCAT-AP.de.

3) Also I like to mention that the current SHACL code is doing validation not by Lenins measure : "Trust is good, control is better".

The following dcat:distribution:

<https://opendata.potsdam.de/api/v2/catalog/datasets/3d-gebaudemodell-lod2-citygml-csv> a dcat:Distribution ;
   <dct:license <https://inqbus.de/evil_license>;
   <skos:inScheme dct:license <https://inqbus.de/evil_license>;

will pass the current validation while it is obviously rubbish since

dct:license <https://inqbus.de/evil_license>

is not in http://dcat-ap.de/def/licenses

3.1) In the "DCAT-AP.de convention handbook" there is not even a mentioning of a required skos:inScheme property for the dct-license. So basing the check for dct:license compliance on this property is fruitless. 3.2) A better validation would be to check the URI of the license against the URIs from DCAT-AP.de. This could be done by some SPARQL in the SHACL code.

Cheers, Volker

volkerjaenisch commented 3 years ago

Just noticed that my explicit shapes had the wrong class type. This is fixed now.

:DCATAPde_license
    a sh:NodeShape ;
    sh:targetClass dct:LicenseDocument ;
    sh:property [
        sh:path skos:inScheme ;
        sh:hasValue <http://dcat-ap.de/def/licenses> ;
    ];
.
:DCATAPde_themes
    a sh:NodeShape ;
    sh:targetClass skos:Concept ;
    sh:property [
        sh:path skos:inScheme ;
        sh:hasValue <http://publications.europa.eu/resource/authority/data-theme>;
    ];
.

Cheers, Volker

init-dcat-ap-de commented 3 years ago

Hello @volkerjaenisch, thank you for your interest, I will try to answer to your questions soon after talking to @GovDataOfficial.

I skimmed over your text and have one question: Is it possible that your example distribution is passing, because of typos? Shouldn't

<https://opendata.potsdam.de/api/v2/catalog/datasets/3d-gebaudemodell-lod2-citygml-csv> a dcat:Distribution ;
   <dct:license <https://inqbus.de/evil_license>;
   <skos:inScheme dct:license <https://inqbus.de/evil_license>;

be

<https://opendata.potsdam.de/api/v2/catalog/datasets/3d-gebaudemodell-lod2-citygml-csv> a dcat:Distribution ;
   dct:license <https://inqbus.de/evil_license> .

I am unsure, what this is supposed to do:

   <skos:inScheme dct:license <https://inqbus.de/evil_license>;

But it is currently not a valid triple.

volkerjaenisch commented 3 years ago

@init-dcat-ap-de Yep. You are right. It should read.

<https://opendata.potsdam.de/api/v2/catalog/datasets/3d-gebaudemodell-lod2-citygml-csv> a dcat:Distribution ;
   <dct:license <https://inqbus.de/evil_license>;
   <skos:inScheme> <http://dcat-ap.de/def/licenses>;
init-dcat-ap-de commented 3 years ago

@volkerjaenisch as far as I can see, the example distribution is still no valid turtle.

Line 2 has more < than >, the first < should not be there. Line 3 should not wrap the skos:inScheme in <..>.

Also, line 3 says, that the distribution itself is skos:inScheme of <http://dcat-ap.de/def/licenses>. But you probably want to say that the license under <https://inqbus.de/evil_license> is a dcatde-license:

<https://opendata.potsdam.de/api/v2/catalog/datasets/3d-gebaudemodell-lod2-citygml-csv> a dcat:Distribution ;
   dct:license <https://inqbus.de/evil_license> .

<https://inqbus.de/evil_license> skos:inScheme <http://dcat-ap.de/def/licenses> .

Yes, checking the use of a correct license by testing for skos:inScheme <http://dcat-ap.de/def/licenses> is "vulnarable". But it has many advantages in handling this check. There is no additional manual labor needed, to convert a new codelist to a testable format.

We could harden this approach by adding the following to the rule:

sh:pattern "^http://dcat-ap.de/def/licenses/" ;  
init-dcat-ap-de commented 3 years ago

Regarding the the rule being too strict: DCAT-AP only allows the use of 1 dct:license. This is checked by official DCAT-AP rules in https://raw.githubusercontent.com/SEMICeu/DCAT-AP/2.1.0-draft/releases/2.1.0/dcat-ap_2.1.0_shacl_shapes.ttl

So a distribution with more than one license is not valid, according to the specification. If there is only one license, "all of them" have to be from the dcat-ap.de-list.

volkerjaenisch commented 3 years ago

As you may have seen from the diffs my comments refer to the govdata 1.0.2 DCAT SHACL code. And there are more than 1 license allowed in that SHACLE rule;

    sh:property [
        sh:path dct:license ;
        sh:minCount 1 ;

        sh:severity sh:Violation ;
        sh:message "Pflicht (K32): Distributionen von Datensätzen MÜSSEN mit einer Lizenz ausgewiesen werden. Für die Kennzeichnung der im Abschnitt 2.1 aufgeführten Lizenzen MÜSSEN die dort genannten URIs verwendet werden. - Seite 23" ;
    ] ;

May you please be so kind to admit that this formulation is not correct (nor from the standard nor from the logic). Instead you are slipping away citing the 2.1-DACAT-AP draft. This is not really fruitful since I have not seen a specification of DCAT-AP.de 2.1, yet.

Regarding the current specification 1.1 you are correct that only 0..1 licenses are allowed. I was misled by your SHACL code 1.0.2. I excuse myself for my misunderstanding, my wrong complaint and not looking in the standard first.

We should drop this bashing and come up with a fruitful discussion that will give us a good SHACL validator for germany, without driving away all our data providers.

What I think we have to discuss is the following. We have data providers which have their own licenses and taxonomy schemes.

They are not willing to drop their licenses but it may be possible to convince them to also add a DCAT-AP.de compliant license, same holds for the taxonomy. Utilizing the current 1.0.2 SHACL code we will not be able to get a single distribution from these data providers to govdata. One prominent example is the capital of our Bundesland with a lot of datasets.

Concerning the categories (dcat:themes). DCAT-AP.de 1.1 (and also the DCAT-AP 2.1 Draft) allows a catalog to specify more than on dcat:ThemeTaxonomy from which the the dcat:themes of the datasets may be chosen. Again the govdata formulation in SHACL 1.0.2 prohibits effectively the use of a secondary taxonomy beside the desired DCAT-AP.de recommended MDR Taxonomy.

I admit that these are no technical but political issues. We have to deal with the clash between an open world data strcuture RDF and the need for setting standards. My understanding of RDF is that you can define your data in a way that fulfill your needs and that you have to additionally fulfill some standards. In other words: I would like to have a SHACL validation for DCAT-AP.de that let all data pass which fulfills DCAT-AP.de without caring about e.g. additional taxonomies, additional licenses etc.

The main problem with DCAT is that it was build without looking at it from the data provider perspective first. Most data providers have there own systems loaded with data which is connected to their internal standards and structural needs. Lets imagine they have to build an interface to make their data DCAT-AP.de conform available. The interface has to drop their license and their taxonomy and bring up instead a DCAT-AP.de License and an MDR classification (that is usually way worse than their own taxonomy).

To get such data providers into the DCAT-AP.de boat you will do it by make it easy for them to build such an interface instead of punishing them for having their own standards.

Easy means to simply

Our data providers do not make any money or gain any advantage by implementing an interface for DCAT-AP.de, so their motivation is really low to do anything.

Cheers, Volker

volkerjaenisch commented 3 years ago

@init-dcat-ap-de We have a meeting with GovData concerning SHACL at Donnerstag, 1. Juli 2021 14:00 – 16:00 I like to have you on the meeting, too. Please contact GovData (Herr Horn) for the details.

Cheers, Volker

volkerjaenisch commented 3 years ago

Servus Herr Rinsche. Können Sie mir bitte Ihre Kontakt-Daten zukommen lassen. Beste Grüße Volker Jaenisch

init-dcat-ap-de commented 2 years ago

Die Syntax-Fehler werden ausgebessert, danke für den Hinweis!

Die Shapes für die Konventionen :DCATAPde_themes und :DCATAPde_license können so nicht übernommen werden, da sie über sh:targetClass auf skos:Concept bzw. dct:LicenseDocument abzielen. Wir wollen aber die Klasse auch als implizit gegeben annehmen, wenn z.B. dct:license auf eine URI verweist.

sh:targetClass versuchen wir nur bei den zu untersuchenden Hauptklassen zu verwenden.