SEMICeu / DCAT-AP

This is the issue tracker for the maintenance of DCAT-AP
https://joinup.ec.europa.eu/solution/dcat-application-profile-data-portals-europe
72 stars 24 forks source link

Identifier cardinality #187

Closed pebran closed 4 weeks ago

pebran commented 3 years ago

On the class dcat:Dataset the property dct:identifier has cardinality 0-* and the usage note: “This property contains the main identifier for the Dataset, e.g. the URI or other unique identifier in the context of the Catalogue ” The part of the text that states that the property contains ‘the main identifier’ seems to indicate that there can only be one such identifier. I would recommend that either the text is changes to reflect the given cardinality or the cardinality is changed to reflect the given text.

bertvannuffelen commented 3 years ago

@pebran

is this a concern about stricter coherency between the usage note text and the formal representation or is this based on some needs you experience in the practice?

I had recently a interesting discussion in the context of harvesting on the need of enforcing a single identifier assigned by the real owner of the dataset/dataservice. In order to avoid any duplicates in the aggregated catalogue resulting from a harvesting process, there is need for enforced identifier management. In that context, the conclusion was to enforce a single identifier that would be supplied by the agent that is closed to the actual management of the described entity. All other identifiers would be additionally registered in the other identifier property.

Observe that in a RDF context the discussion on the property identifier is even more complicated, as the URI used in the RDF structure should that be the main identifier or can it be a local identifier in the RDF graph. In a perfect world both coincide.

pebran commented 3 years ago

@pebran is this a concern about stricter coherency between the usage note text and the formal representation [...]?

Yes, it is just that. If we use dct:identifier as "the main identifier" it is by some users interpreted as saying that the cardinality should noy allow for more than one. And if we do allow more than one instanse of dct:identifier for each Dataset, then the text should not state that the property contains the main identifier.

jakubklimek commented 2 years ago

I do not see the advantage in enforcing a single identifier here. For disambiguation purposes, sets of identifiers can be used just as well. One dataset is typically registered in multiple data catalogs, where in each catalog, it gets a different IRI assigned in the RDF representation, and it is not so easy to say which one is the main one and which one is the other one.

Example from Czechia - publishers from public administration publish datasets, and register them in the National Open Data Catalog, which is DCAT-AP compatible and assigns its own IRI to each registered dataset. When this dataset comes from a local, perhaps CKAN-based data catalog, the original dataset record has a pure string (non-IRI) identifier, local to the CKAN instance. Now, which one should be the main one? The original one is not globally unique, as it is not an IRI. This could cause incorrect disambiguation when this identifier clashes with another one. Therefore, I would say that the IRI given by the National catalog should be the main identifier. But the publisher might see it differently.

By saying that we have a set of known identifiers, this could be resolved quite nicely.

bertvannuffelen commented 2 years ago

@jakubklimek in Belgium this also being discussed. There is no problem if publishers would always use strong identifiers that are environment agnostic (the same id for a city portal, national portal as in the thematic portal). The problem is that often localnames are used as identifier ("dataset-1") in those cases the likelihood of clashes with between 2 publishers is real when harvesting. To disambiguate those cases, the context of the catalog must be taken into account. One solution is then to replace the identifier with a prefixed version of the localname with the namespace of the catalog included. Often this disambiguation is already done when creating an RDF representation because a URI needs to be created for the dataset as a technical identifier. In the Semantic Web/Linked Data we are going even one step further, namely that this technical identifier and the prime identifier coincide. And thus there is no need for dct:identifier.

So unless we impose that identifiers are URIs, this kind of considerations will keep.to exist.

A third challenge is that UIs of Portals tend to introduce another identifier: the id that is used to identify the view in the portal. To illustrate the issue, these 4 portal identifiers point to the same dataset:

The question is how portals should highlight the publisher assigned identifier (5c52b299-8f05-4d35-9839-a42934f1e619) so that visitors to all these portals see that this is about the same dataset.

So I think that @pebran had in mind with this request also the above case: there should be only one publisher assigned identifier for a dataset and that this one is carried in dct:identifier.

init-dcat-ap-de commented 2 years ago

We think the cardinality should be max 1. Other/secondary/new identifiers should be added as adms:identifier.

bertvannuffelen commented 2 years ago

During WG 21 Oct 2021, the wg decided to address this issue in a broader discussion on expectations on identifiers. So for now no changes will result from this issue.

bertvannuffelen commented 4 weeks ago

We propose to close this issue as in release 3.0.0 in section on Identifiers a reference is made to the outcome of a community discussion on identifiers.

In case new use-cases or expectations or harmonisation proposals are needed we propose to open a new issue and start the discussion from that section.