SEMICeu / MLDCAT-AP

2 stars 1 forks source link

Property Shapes duplicates in SHACL Shapes in 2.0 release #19

Open amivanoff opened 1 month ago

amivanoff commented 1 month ago

Property shapes with the same shacl:path and different generated IRIs repeats twice or sometimes even 3-4 times in the Turtle spec. JSON-LD affected also.

For example, several property shapes repeats just for the CatalogShape class shape;

Just one concrete example for the foaf:homepage property shape (lang tag stripped):

<#CatalogShape/8d36a62f83db7f94097e27edb51306e17e0d40f3> rdfs:seeAlso "https://semiceu.github.io/MLDCAT-AP/releases/2.0.0#Catalogue.homepage";
  shacl:description "A web page that acts as the main page for the Catalogue.";
  shacl:name "homepage";
  shacl:nodeKind shacl:BlankNodeOrIRI;
  shacl:path foaf:homepage .

<#CatalogShape/da28472666d298998330cb159b2c1e90b4446250> rdfs:seeAlso "https://semiceu.github.io/MLDCAT-AP/releases/2.0.0#Catalogue.homepage";
  shacl:description "A web page that acts as the main page for the Catalogue.";
  shacl:maxCount 1;
  shacl:name "homepage";
  shacl:path foaf:homepage .

<#CatalogShape/fb8568b313de736f9184db23272b6317700e9e7e> rdfs:seeAlso "https://semiceu.github.io/MLDCAT-AP/releases/2.0.0#Catalogue.homepage";
  shacl:class foaf:Document;
  shacl:description "A web page that acts as the main page for the Catalogue.";
  shacl:name "homepage";
  shacl:path foaf:homepage .

Some times property some shape variants misses cardinality restriction. Some times it differs with shacl:nodeKind shacl:BlankNodeOrIRI or shacl:class.

If it needs a variability in value restrictions (BlankNodeOrIRI or concrete class), the correct way is to use sh:or, I think.

If property shape's IRIs weren't random, this would be a minor problem 😊 But as it is, it seems it's an error. And an "adoption blocker" one.

amivanoff commented 1 month ago

Well, validator accepts the spec and validates something. And DCAT-AP 3.0 has the same repetitions in https://github.com/SEMICeu/DCAT-AP/blob/master/releases/3.0.0/shacl/dcat-ap-SHACL.ttl and no one cares. But there are some disadvantages:

  1. It is still very human-unfriendly. When it requires 4 property shapes for the dcterms:publisher property: one for each constraint, in different parts of a spec file. And each property shape requires a unique IRI and 5 text lines, mainly with repetitions:
    • 5 lines for shacl:maxCount
    • 5 lines for shacl:maxCount
    • another 5 for shacl:BlankNodeOrIRI
    • and another 5 for foaf:Agent
  2. From validator point of view it looks like shacl:class foaf:Agent subsumes/overrides shacl:nodeKind shacl:BlankNodeOrIRI.
  3. And also it hurts any attempts to create UIs based on shapes (sh:order and other metadata).
EmidioStani commented 1 month ago

The shapes are automatically generated via a publication system that is used by Semic, to be checked at overall but I will keep an eye on this to see if it improves, thanks for reporting.

The fact that for a property it is splitted on multiple property shapes it is by design, to be modular.

Thanks again, any feedback on the model?

amivanoff commented 1 month ago

What kind of feedback are you interested in and what type of issues are your willing to address at this lifcycle stage of the spec? The minor ones (e.g. collisions in the spec like ones reported previousely by @VladimirAlexiev)? Or the bigger ones (missing properties/classes)?

At the first stage of the spec consumption we are dealing with a bunch of minor issues (i.e. the first type of issues).

For example, another issue is "Different prefixes for the same dc/terms namespace in DCAT 3 and in MLDCAT-AP profile, also conflicting with the established common practice".

# in DCAT 3 ontology
@prefix dcterms: <http://purl.org/dc/terms/> .
# in MLDCAT-AP SHACL
@prefix dc: <http://purl.org/dc/terms/> .

The issue looks similar to this issue https://github.com/SEMICeu/MLDCAT-AP/issues/7 by @VladimirAlexiev.

According to the prefix lookup service https://prefix.cc

To summarize:

The "dc" prefix should be changed to "dcterms" in MLDCAT-AP for the following reasons:

The better (but it seems way more disruptive) way could be to switch the overall DCAT stack (DCAT ontology, all AP profiles) from "dc" and "dcterms" to "dct" prefix.

Should we report this kind of issues here? Or it is better to address here only bigger ones (missing properties/classes)?

amivanoff commented 1 month ago

We "un-modularized" property shapes, made the SHACL shapes more "human-friendly" (at least we hope so :) ) and fixed some our issues/struggles in our fork repository https://github.com/agentlab/MLDCAT-AP while trying to stay compatible as much as possible with the spec. Maybe it could be helpful to someone with similar goals.

bertvannuffelen commented 1 month ago

@amivanoff your comment and your resolution shows one of the main challenges for SHACL artefacts. Your objective is to have a human manageable, and somehow human readable formulation of the contraints expressed in that file.

The formulation you created has a number of considerations:

  1. it is not allowing multilingual texts and advices to associated with the constraints (due to limitation of SHACL see: https://github.com/w3c/data-shapes/issues/158)
  2. There are many reasons, even in your suggested way where splitting is required for instance if the level of severity is different (e.g. mincard is mandatory, maxcard is recommended)
  3. the human readility form is to a certain level a matter of taste and way how one likes to use the content. (It is like a writeing style for instance your prefix choice -- technically any choice is fine; prefixes are just a local abbreviation table ).

But the most challeging aspect is maintenance and compleneteness. Our generators can generate a variant of your suggestion but because of 1 we switched. For the usage of validation (use the file as-is in a shacl engine) the condensed or splitted version has no impact.

DCAT-AP devotes a whole section on validation (https://semiceu.github.io/DCAT-AP/releases/3.0.0/#validation-of-dcat-ap). You will see that there a human managed collection of shapes is added. Those are manually maintained because they target, as explained there, various validation situations. In principle, a large part of those could be done by referring to the generated ones (a first approach for that use is findable in https://semiceu.github.io/DCAT-AP/releases/3.0.0-hvd/#validation).

This brings us to the main advantage for the DCAT-AP ecosystem is that a collection of named individual constraints allows to relate requirements (and in this case the SHACL formulation of the requirement) to interlink. The DCAT-AP profile of the Swedisch Geocatalogue can refer directly to a requirement in the SHACL. It makes comparisons in that way easier and decisions more transparant. Towards interlinked specifications.

Of-course improvements can be made, and some of your and other comments on the SHACL indicate issues, and will be resolved over time. But in this ecosystem of overlapping use of the same data with respect to different requirements we can take benefit of the power of linked data to offer additional services.

As a last note: the SHACL of the SEMIC specifications is a consequence of what is written in the HTML. Not vice versa. It only reflects the constraints that easily can be written in SHACL.

I hope this answer provides you some insights on the why of the taken approach.

amivanoff commented 1 month ago

Well, today different shapes representations with gifferent goals in ming are technically allowed and even advised by some people. Sometimes in the future the "brave new world" of RDF-Star (or RDF1.2) will unite us all (with possibilities to add anything to a triple)... If we live long enough ))) In the meantime, because we rely heavily on property shapes completeness in our dynamic Web UIs generation and dynamic shapes-based SPARQL queries generation, we will start with non-modular shapes and then test "normalized" version later on.

VladimirAlexiev commented 1 month ago

@bertvannuffelen your reason 1 is false: the examples given in the description do NOT describe the individual error. They all describe the field, and are all the same.

Two more defects:

amivanoff commented 1 month ago

@bertvannuffelen your reason 1 is false: the examples given in the description do NOT describe the individual error. They all describe the field, and are all the same.

@VladimirAlexiev, if you divide "one-propertyshape-for-one-property" into several "smaller-propertyshapes" with only one constraint in each of this "smaller-propertyshapes", then the shacl:message in each of it will be the "individual error message", I think.

I am seen such an approach to the SHACL validation for the first time in years. But its working 😊

Example:

1 One "one-propertyshape-for-one-property" for the dc:publisher property of dcat:Catalog class. The error message here is for the whole property (field).

<#CatalogShape/publisher> rdfs:seeAlso "https://semiceu.github.io/MLDCAT-AP/releases/2.0.0#Catalogue.publisher";
  shacl:name "publisher"@en;
  shacl:description "An entity (organisation) responsible for making the Catalogue available."@en;
  shacl:path dc:publisher;
  shacl:nodeKind shacl:BlankNodeOrIRI; #not working here
  shacl:class foaf:Agent;
  shacl:minCount 1;
  shacl:maxCount 1;
  shacl:message "All publisher's constraints are wrong"@en .

2 Several "smaller-propertyshapes" the dc:publisher property of dcat:Catalog class. The error messages here are for the individual constraints of a property (field).

<#CatalogShape/93f73e69bb03d2928fcf758a253ef316becdf9b9> rdfs:seeAlso "https://semiceu.github.io/MLDCAT-AP/releases/2.0.0#Catalogue.publisher";
  shacl:name "publisher"@en;
  shacl:description "An entity (organisation) responsible for making the Catalogue available."@en;
  shacl:path dc:publisher;
  shacl:nodeKind shacl:BlankNodeOrIRI; #will work only if you disable the shacl:class rule b3ec0655204c62a2531244aaeab12f1a2c5e5b5d
  shacl:message "Only publisher's nodeKind constraint is wrong"@en .

<#CatalogShape/b3ec0655204c62a2531244aaeab12f1a2c5e5b5d> rdfs:seeAlso "https://semiceu.github.io/MLDCAT-AP/releases/2.0.0#Catalogue.publisher";
  shacl:name "publisher";
  shacl:description "An entity (organisation) responsible for making the Catalogue available."@en;
  shacl:path dc:publisher;
  shacl:class foaf:Agent;
  shacl:message "Only publisher's class constraint is wrong"@en .

<https://semiceu.github.io/MLDCAT-AP/releases/2.0.0#CatalogShape/a0ccdf3bd7f5d161d07f375a26e68c18ca91dc19> rdfs:seeAlso "https://semiceu.github.io/MLDCAT-AP/releases/2.0.0#Catalogue.publisher";
  shacl:name "publisher"@en;
  shacl:description "An entity (organisation) responsible for making the Catalogue available."@en;
  shacl:path dc:publisher;
  shacl:minCount 1;
  shacl:message "Only publisher's minCount constraint is wrong"@en .

<#CatalogShape/67dcdb36167ca7969c0532898e11a98e9c2a80f5> rdfs:seeAlso "https://semiceu.github.io/MLDCAT-AP/releases/2.0.0#Catalogue.publisher";
  shacl:name "publisher"@en;
  shacl:description "An entity (organisation) responsible for making the Catalogue available."@en;
  shacl:path dc:publisher;
  shacl:maxCount 1;
  shacl:message "Only publisher's maxCount constraint is wrong"@en .

Two more defects:

  • seeAlso should be a URL
  • the shapes better have rdf:type (though the spec says it's optional)

Yes, seeAlso as string -- this is definitely a bug. rdf:type shacl:PropertyShape -- this is desirable like "good style".

amivanoff commented 1 month ago

I am seen such an approach to the SHACL validation for the first time in years. But its working 😊

To me it looks like "technology abusing" but yeah, "it's working"...

And I could not see any other way to do a granular internationalizable error messages for each constraint of a property shape on each of all EU languages... Besides deep internationalization of the Jena SHACL Validator internal mechanics (or RDF4J, or any other open source SHACL validator).

But I presume, Jena's maintainers would not be happy to make Jena speaks another 23 languages besides English.

Do not know what @HolgerKnublauch thinks about this "language attack" on shapes...

amivanoff commented 1 month ago

It seems, the SEMIC guys takes localization/internationalization of validator error reports VERY seriously. They want to provide as much as possible of validator report to the user (i.e. integration specialist?) on a local language. But they also don't want any compromises on error details. So they want both error messages:

amivanoff commented 1 month ago

Maybe it is better to call it "a bunch of property constraints", not a "property shape". Because in this case "property shape" is not specified explicitly in spec in it's complete form. It is not reifyable/addressable (no IRI). Property shape is constructed by validator in runtime as a conjunction of a class shape and a shacl:path.

HolgerKnublauch commented 1 month ago

In the current SHACL version it is indeed required to define a separate shape whenever you want to specify a different message. With the upcoming 1.2 I hope we can generalize this so that reification can be used to attach message (and severity and possibly more) to each constraint triple. That should help here.

VladimirAlexiev commented 1 month ago

@amivanoff

I am seen such an approach to the SHACL validation for the first time in years. To me it looks like "technology abusing". But its working 😊

This is not abuse. This is a way to make individual checks more atomic, thus easier to generate.

the detailed validator error message on particular constraint in English (because validator will never speak Portuguese), authored by validator engine developers; and the localized Portuguese (and other langs) error message on particular constraint from this "one constraint property shape"

The spec says that:

@amivanoff Do you see any spec change required for multilingual translations?

required to define a separate shape whenever you want to specify a different message

Yes, but only if they say different things. Multiple shapes are not needed to accommodate multiple translations.

amivanoff commented 1 month ago

@VladimirAlexiev, yes, one shape per constraint, and this shape contains multiple translations for the message property.

This issue enables a capability to preserve property shape reification (if authors will be willing to).

But one aspect still stands unhandled in this issue

It is related to the "message substitution" semantics. In "message substitution/redefinition" example from the issue above it could be only one of two cases:

  1. "Property may only have 1 value, but found 2" -- from validatior if there is no message substitution
  2. "Maximaal 1 waarde"@nl with message substitution.

In case 2, we lose detailed information from a validator "but found 2" (i.e., how many constraint violations for this object-propertyshape have been found). So with "error message substitution" we could translate general messages only which do not take into account specific data situation.

In the DCAT-AP SHACL profile colleagues tried to dump sh:message altogether and use a custom unresolvable https://purl.eu/ns/shacl#message to save both: the message from a validator and the translated message from shape.

In the released DCAT 3.0 version they did not use any of message at all.

All of it raises a question of sh:message usefulness in general. With sh:message we could translate "general advices" only, at the cost of a more detailed error message from validator. We could not have both messages (detailed message from vaildator in english and "general advice" message, translated to another language). We could not have a detailed validator message translated.

I could not grasp if this issue could help with all above

bertvannuffelen commented 1 month ago

@amivanoff

I am seen such an approach to the SHACL validation for the first time in years. To me it looks like "technology abusing". But its working 😊

This is not abuse. This is a way to make individual checks more atomic, thus easier to generate.

exactly, but also

  • multiple profile management becomes simpler: one can point to one individual constraint rather to a collection of constraints.
  • cross-referencing can be made more precise: e.g. the seeAlso can be for each constraint pointing to the appropriate location.

All these have to do with use-cases of designing a business UI for an Validation service where the result is guiding the user to the most important issues to resolve. Today validators like https://www.itb.ec.europa.eu/shacl/dcat-ap/upload produce a technical table Error - message - relatedValue. And then the hunt is on. One has to be an RDF expert to find the source (which is in most cases trivial for an RDF expert) but the resolvement is harder.

To illustrate the above the following 3 values are licences found in an open data portal (value of dct:license). The first is acceptable but the second and probably also the 3 not.

http://dcat-ap.de/def/licenses/cc-by/4.0 
"N06abcab9a78347dca72ba692979c3cdc" 
http://dcat-ap.de/def/licenses/CC%20BY%204.0 

Being able to cross-reference to https://semiceu.github.io/DCAT-AP/releases/3.0.0-hvd/#c3 in case of HVD compliance is a valuable motivation for publishers to get at least rid of the second, but likely also for the third. That is different from the validator does not like it. With such a cross reference the RDF expert can more easily motivate the dataset owner (some publisher in some agency) to adapt its source metadata.

SHACL is in our context also a mean to provide service to non-technical RDF staff.