SEMICeu / DCAT-AP

This is the issue tracker for the maintenance of DCAT-AP
https://joinup.ec.europa.eu/solution/dcat-application-profile-data-portals-europe
74 stars 24 forks source link

SHACL-Webinar: Validation of Codelists #165

Closed init-dcat-ap-de closed 2 years ago

init-dcat-ap-de commented 3 years ago

Thank you again for doing the SHACL-Webinar and releasing the screencast. We now had a chance to revisit the discussed problems and solutions. (See: #162) We want to move forward with the validation of DCAT-AP.de for GovData.de and we would like to do so by re-using the shapes from DCAT-AP.

Around the 90-minute mark, we discussed, how codelists should be used within the SHACL-rules.

It was our understanding, that the use of skos:inScheme is an acceptable way, as long as the used codelist is of reasonable size. Since even http://publications.europa.eu/resource/authority/corporate-body with over 1000 entries can be handled by the tools, we think we can use this way for all code lists. But as with the integrated background knowledge (#163), we have to decide how we announce/show, which lists are checked. We would, again, prefer a way which is visible within the SHACL-shapes, not only in the tool configuration.

(There is one coldelist where this approach doesn’t work: http://data.europa.eu/r5r/availability/ <-- DCAT-AP’s own codelist.)

init-dcat-ap-de commented 3 years ago

Releated issues: #159, #127, #119

bertvannuffelen commented 3 years ago

In the branch https://github.com/SEMICeu/DCAT-AP/tree/2.1.0-draft/releases/2.1.0 a proposal for importing the used controlled vocabularies is made.

init-dcat-ap-de commented 3 years ago

Thank you for the new branch.

I tried to re-use the imported controlled vocabularies for our validator on https://www.itb.ec.europa.eu/shacl/dcat-ap.de/upload, because if I use use the any-validator (https://www.itb.ec.europa.eu/shacl/any/upload) to validate

with the shapes

I get a lot of errors, as seen in the picture. That's especially strange because example1.nt does work as exspected and the only difference is that in 2 the publisher is a foaf:organization, not a foaf:agent. I would have thought that including

would fix the problem, but unfortunately it does not.

Has anyone an idea, why this problem exists?

grafik

bertvannuffelen commented 3 years ago

I retried the suggested case with example2.nt and the 2 SHACL template files without the imports and I get the following output.

Screenshot_2020-11-30 SHACL Validator

That is an expected output for the example when including the mdr constraints.

bertvannuffelen commented 3 years ago

The reported errors which are show in the issue are all the same: a conceptscheme should have a title. That is a mandatory property for a Category Scheme according to DCAT-AP. These appear when the controlled vocabularies are imported in the SHACL validator. These are defined in https://raw.githubusercontent.com/SEMICeu/DCAT-AP/2.1.0-draft/releases/2.1.0/dcat-ap_2.1.0_shacl_mdr_imports.ttl.

As it turns out: most of these controlled vocabularies do not have a title for the Conceptscheme. Missing that property does not mean they are not viable controlled vocabularies. This is a typical situation when reusing information. Sometimes not all the constraints are satisfied by the provided data out of the box. So a solution can be that for each of the controlled vocabularies create a value for dct:title.

But note this will not change the validity of a dataset provided by a publisher as this violated rule is mostly a concern for the catalogue / portal owners.

giorgialodi commented 3 years ago

I am not able to follow as I wish the work for DCAT-AP; however, my attention was caught by this specific issue. Why does DCAT-AP impose as mandatory property dct:title for a class that is skos:ConceptScheme (in case this latter is being used)? Probably it is a silly question since I've not followed the works so far, as I wished but it sounds strange to me. I mean, typically controlled vocabularies are offered by EU vocabularies and they are considered external entities with respect to any dataset publisher and catalogue/portal owner. These vocabularies are dealt with using skos. So it is likely that they will have, for the ConceptScheme class, the skos:prefLabel property (as it is done for the skos:Concept) and not dct:title.

And if I am a catalogue provider I will use something like this:

myCatalogue a dcat:Catalogue ; dcat:themeTaxonomy < http://publications.europa.eu/resource/authority/data-theme > .

and that's it. From my point of view it is not wise to oblige them to write something like this:

< http://publications.europa.eu/resource/authority/data-theme > a skos:ConceptScheme ; dct:Title "Data theme"@en .

From my point of view the wiser thing to do is to re-consider the dct:title property for the ConceptScheme class in the DCAT-AP specifications and probably replace it with skos:prefLabel.

bertvannuffelen commented 3 years ago

@giorgialodi there are for this constraint multiple dimensions. And they are inherently connected with (re)use of data.

As example I take e.g. the theming of a dataset

perspective dataset owner As dataset owner, I want to categorize my dataset with a theme. For that I want to refer to the identifier of the theme.

perspective portal owner As portal owner, I want to show a faceted view on my catalogue. The widget showing the categories has as name the title of the conceptscheme and as structure the tree view of the conceptscheme.

perspective harvester As harvester, I want to ensure that the categories associated with the datasets in the harvested catalogue belong to the concept scheme I have configured. In case there is a violation error, then I want to report to the harvested catalogue publisher the name of the conceptscheme in which context the error has happened.

perspective codelist maintainer As codelist maintainer, I want to maintain a codelist for categorizing datasets in Open Data Portals.

These 4 perspectives give an overview of the parties involved in this ecosystem. The parties are not in direct contact with each other. And their objective might not be creating an harmonized ecosystem for Open Data Portals, but something distinct. E.g. many of the codelists have not as sole target audience open data portals but are much broader. Often codelist maintainers are not active participants in the coherency of the usage context, but play a more supportive role. Some information is more usage specific and thus not part of the generic codelist maintenance.

DCAT-AP specification states that a category/theme for a dataset is a code in a codelist. It imposes that it is modeled using SKOS. It however does not impose that the codelist is published and maintained according to all best practices of the semantic web. That means that it is fine for a data portal owner to agree with its publishers based on a CSV table in which the labels are being agreed to share the theme as follows:

my:dataset1 dcat:theme [ 
   a skos:Concept ;
   skos:prefLabel "theme1"
]

In the case the codelist is maintained and published with persistent identifiers there are 2 options:

my:dataset1 dcat:theme codelist:theme1.
codelist:theme1  a skos:Concept ;
   skos:prefLabel "theme1".

or more condensed

my:dataset1 dcat:theme codelist:theme1.

The choice is not made by the specification, but by the implementation agreement between data portal owner and dataset owner. The specification only stimulates to use codelists that are published according to the best practices. In the first place the specification is about agreeing on the same semantical foundation. Implementations have freedom how they implement these agreements.

But if they introduce in the data information that is covered by the DCAT-AP specification they have to do that conformant with the specification. According to this: it is not allowed to share theme information as plain literals.

Returning back to the validation error that is being reported: I believe in the past the WG decided to enforce a title on a conceptscheme to ensure a minimal amount of information about the conceptscheme is available. Probably other choices could be made also.

As the example also shows, it is for the implementer to choose which validation constraints are needed for conformance. The SHACL templates in this repository cover the broad setting.

stigbd commented 3 years ago

I am not able to follow as I wish the work for DCAT-AP; however, my attention was caught by this specific issue. Why does DCAT-AP impose as mandatory property dct:title for a class that is skos:ConceptScheme (in case this latter is being used)? ... From my point of view the wiser thing to do is to re-consider the dct:title property for the ConceptScheme class in the DCAT-AP specifications and probably replace it with skos:prefLabel.

In the Norwegian Data Portal we are also running into this issue. As our own dcat-ap-no is mandating use of various EU-vocabularies, our validator throws a violation on the lack of dct:title in the EU-vocabularies. E.g. http://publications.europa.eu/resource/authority/licence which uses skos:prefLabel, and not dct:title.

We also find it more intuitively correct, and in accordance with the practice in many EU-vocabularies, that skos:prefLabel is being used instead of dct:title.

bertvannuffelen commented 3 years ago

Created a specific request on this in #192

bertvannuffelen commented 3 years ago

@stigbd do you check all controlled vocabularies or do you only check the category (dct:theme) NAL?

stigbd commented 3 years ago

Yes, we do check all controlled vocabularies, as far as they are mandated by the DCAT-AP. Other examples are: