SEMICeu / DCAT-AP

This is the issue tracker for the maintenance of DCAT-AP
https://joinup.ec.europa.eu/solution/dcat-application-profile-data-portals-europe
74 stars 24 forks source link

SHACL: Constraint on dct:format #176

Closed andrea-perego closed 2 years ago

andrea-perego commented 3 years ago

While validating some records, a sh:Violation (instead of a sh:Warning) is raised when the distribution format is not specified as a URI reference.

Is this intentional?

jakubklimek commented 3 years ago

DCAT 2 specifies that the range of dcterms:format is dct:MediaTypeOrExtent, therefore, the value must be a URI reference. dcterms:format is, at the same time, Recommended, not Mandatory for a Distribution.

If by

distribution format is not specified as a URI reference

you mean that the format is specified as an RDF Literal instead, then it makes sense to me to treat this as sh:Violation because the spec says that the format does not have to be there, but when it is, it must be a URI reference. Therefore, it being there, not being a URI reference is a violation of this, and applications counting on the value being an URI may be broken by this.

andrea-perego commented 3 years ago

@jakubklimek said:

DCAT 2 specifies that the range of dcterms:format is dct:MediaTypeOrExtent, therefore, the value must be a URI reference.

Not necessarily. It can be a blank node. The constraint I am referring to is the following one, from dcat-ap_2.0.1_shacl_mdr-vocabularies.shape.ttl:

:Distribution_ShapeCV
    a sh:NodeShape ;
    sh:property [
        sh:node :FileTypeRestriction ;
        sh:nodeKind sh:IRI ;
        sh:path dct:format ;
    sh:description "A non EU managed concept is used to indicate the format of the distribution. If no corresponding can be found inform the maintainer of the fileformat NAL." ; 
        sh:severity sh:Violation
    ] , ...

The use of URI references for controlled vocabularies is recommended in DCAT-AP, not mandatory. So, the sh:severity here should be sh:Warning. Otherwise, the DCAT-AP spec may need to be revised accordingly.

jakubklimek commented 3 years ago

The use of URI references for controlled vocabularies is recommended in DCAT-AP, not mandatory.

@andrea-perego Actually, section 5.2 of DCAT-AP 2.0.1 states:

In the table below, a number of properties are listed with controlled vocabularies that MUST be used for the listed properties.

I would say that makes them mandatory, not recommended.

andrea-perego commented 3 years ago

@jakubklimek said:

The use of URI references for controlled vocabularies is recommended in DCAT-AP, not mandatory.

@andrea-perego Actually, section 5.2 of DCAT-AP 2.0.1 states:

In the table below, a number of properties are listed with controlled vocabularies that MUST be used for the listed properties.

I would say that makes them mandatory, not recommended.

Strictly speaking, this is a constraint on skos:inScheme. The recommendation about using URI references is in Section 5.1:

Controlled vocabularies SHOULD:

  • ...
  • Have terms that are identified by URIs with each URI resolving to documentation about the term.

However, if the use of URI references is mandatory for all controlled vocabularies, this requirement is inconsistently implemented in the SHACL shapes. E.g., in the following SHACL shape, not using a URI reference results sometimes in a sh:Violation, sometimes in a sh:Warning:

:Dataset_ShapeCV
    a sh:NodeShape ;
    sh:property [
        sh:node :FrequencyRestriction ;
        sh:nodeKind sh:IRI ;
        sh:path dct:accrualPeriodicity ;
    sh:description "A non EU managed concept is used to indicate the accrualPeriodicity frequency. If no corresponding can be found inform the maintainer of the EU frequency NAL" ; 
        sh:severity sh:Violation
    ], [
        sh:node :LanguageRestriction ;
        sh:nodeKind sh:IRI ;
        sh:path dct:language ;
    sh:description "A non EU managed concept is used to indicate a language. If no corresponding can be found inform the maintainer of the EU language NAL" ; 
        sh:severity sh:Violation
    ], [
        sh:node :CorporateBodyRestriction ;
        sh:node :Publisher_ShapeCV ;
        sh:nodeKind sh:IRI ;
        sh:path dct:publisher ;
    sh:description "A non EU managed concept is used to indicate the publisher, check if a corresponding exists in the EU corporates bodies NAL" ; 
        sh:severity sh:Warning
    ], [
        sh:node [
            a sh:NodeShape ;
            sh:or (:CountryRestriction
                :PlaceRestriction
                :ContinentRestriction
                :GeoNamesRestriction
            )
        ] ;
        sh:nodeKind sh:IRI ;
        sh:path dct:spatial ;
    sh:description "A non managed concept is used to indicate a spatial description, check if a corresponding exists" ; 
        sh:severity sh:Warning
    ], [
        sh:node :DataThemeRestriction ;
        sh:nodeKind sh:IRI ;
        sh:path dcat:theme ;
    sh:description "Multiple themes can be used but at least one concept of <http://publications.europa.eu/resource/authority/data-theme> should be present" ;
        sh:severity sh:Warning
    ] ;
    sh:targetClass dcat:Dataset.

A possible explanation can be guessed from the sh:description of each property shape:

  1. you get a sh:Violation where only the vocabularies in section 5.2 can be used; otherwise,
  2. you get a sh:Warning.

If this interpretation is correct, the fact that some properties MUST be used ONLY with some vocabularies should be clearly spelt out in the DCAT-AP specification, which states instead (in section 5.3) that you can use additional vocabulaires, provided that those in section 5.2 are used.

bertvannuffelen commented 3 years ago


A possible explanation can be guessed from the `sh:description` of each property shape:

    1. you a get a `sh:Violation` where only the vocabularies in section 5.2 can be used; otherwise,

    2. you get a `sh:Warning`.

If this interpretation is correct, the fact that some properties MUST be used ONLY with some vocabularies should be clearly spelt out in the DCAT-AP specification, which states instead (in section 5.3) that you can use additional vocabulaires, provided that those in section 5.2 are used.

When reading all feedback, then this is the topic which triggered the discussion.

Indeed, the shacl shapes implement the usage of the mandatory listed controlled vocabularies as mandatory (sh:Violation) in the case there is no explicit mentioning of the acceptance of another codelist.

The goal of this strictness is ofcourse that we can create a EU wide common perspective on the data when aggregation catalogues.

Nevertheless I think the rules are in order for the usage of controlled vocabularies:

  1. when the usage qualification for a codelist is MANDATORY, it MUST be used
  2. when the usage qualification for a codelist is CONDITIONAL MANDATORY, the codelist must be used when it it possible to meet the condition (*)
  3. when the usage qualification for a codelist is RECOMMENDED/OPTIONAL, it is an aid for implementations to follow it.

(*) An example for 2, is the EU corporate bodies NAL. Actually no data catalogue, except an catalogue targetting the EC can used it without serious implementation costs. Although the shacl encodes it, it will probably never be triggered by any MS catalogue.

The list in section 5.2 covers all 3 cases under a title "Controlled vocabularies to be used".

bertvannuffelen commented 3 years ago

There is also another line of thought in this issue. Namely the recommendation to use proper codelists. Namely as real controlled vocabularies.

That is the other message in the specification. Where possible use a managed controlled vocabulary and not a flat list of values. The target is set by controlled vocabularies of the PO: Persistent URIs, SKOS based modeling, concept lifecycle management, etc.

So instead of

_:mydataset dcat:theme "education". 

we prefer

_:mydataset dcat:theme <http://publications.europa.eu/resource/authority/data-theme/EDUC>. 

But as we know the latter can be encoded in many ways, depending on the quality of publication of the codelist.

Following cases are intermediates situations:

  1. technical alignment using an anonymous node
    _:mydataset dcat:theme [ 
        a skos:Concept ; 
        skos:prefLabel "education" ]. 
  2. technical alignment using a skolem node
    _:mydataset dcat:theme <https://mydomain/.wellknown/edu>.
    <https://mydomain/.wellknown/edu> 
        a skos:Concept ; 
        skos:prefLabel "education".
  3. use a local codelist not published as SKOS
    
    _:mydataset dcat:theme <http://api.catalogue.ms.com/theme/education>. 

http://api.catalogue.ms.com/theme/education a skos:Concept ; skos:prefLabel "education".

4. use a local codelist published as SKOS

_:mydataset dcat:theme http://api.catalogue.ms.com/theme/education.



Observe that case 4 is closest to the preferred case, however the difference between 4 and 3 is hard to make. In 3 the codelist is supplied with the catalogue in 4 it isn't. And that makes a big difference for the SHACL shapes. 
So it is hard to make generic SHACL shapes that will cover all 4 cases because for the validation one needs to know the background knowledge. 
Moreover the difference between 4 and 1 is very shallow. Namely if the additional information in order to pass the SHACL constraint is  the creation of situation 3, then a publisher can just claim it satisfies the constraint. It will not contribute to the quality of the publication. 

The SHACL shapes check one pattern, maybe in a very strict formulation, but it will be the implementation context that needs to assess if this formulation is adequate or not. 

I am open for discussion on the shapes on the controlled vocabularies, but be aware that there are many formulations for the same constraint depending on the objective you have with the validation and the usage context (namely which background knowledge you take into account, what you expect to receive from the catalogue owner, the quality of codelist publication and the inference engines you use).