identifiers - Githubissues

bertvannuffelen commented 2 years ago

This is an broad issue to capture questions and opinions on identifiers. During the webinar of 10 march 2022 the WG discussed on the role of dct:identifier and adms:identifier in identifying datasets throughout harvesting of catalogues.

To streamline the discussion, the WG agreed with the view that dct:identifier is the identifier assigned by the "owner/first publisher" of the dataset. This removes an ambiguity in the definition of dct:identifier which could be also interpreted as the identifier assigned by the catalogue it is currently part of.

This issue is to collect the community feedback on this topic. We will also provide a coherent proposal based on the WG discussion that has taken place.

bertvannuffelen commented 2 years ago

Dear community,

a proposal for the guidelines to comment on can be found at: the https://github.com/SEMICeu/DCAT-AP/blob/2.x.y-draft/releases/2.x.y/usageguide-identifiers.md

As during the last webinar no agreement was on the status of this proposal it is shifted to a future release. Also this is a new invite to provide comments to the proposal.

jakubklimek commented 1 year ago

The Czech data catalog implements what is to be avoided by the guidelines - it mints an IRI for a harvested dataset regardless of its original IRI. If there was an original IRI, it is preserved in dct:identifier.

This is not to argue that the approach is correct, but I would like to take this opportunity to mention arguments that led us to this implementation that I did not find mentioned in the guidelines.

Guaranteed dereferencablity of the IRIs. The source catalog assigns IRIs to datasets, but does not implement their dereferencablility, or the dereferencability of other IRIs - distributions, data services, etc. The national catalog does that, but that only works with IRIs in its domain.
Security (Trustworthiness of the registered catalogs) - By assigning new (publisher-scoped) IRIs and processing the metadata instead of taking it unaltered when harvesting the datasets, we can avoid one publisher stating (intentionally, or by mistake) something about a dataset of another publisher without their knowledge, which could affect query results on the single National Open Data Catalog SPARQL endpoint. Admittedly, this goes against the open-world assumption, but in the context of a public administration system, this is something we want to avoid rather than encourage.

bertvannuffelen commented 1 year ago

@jakubklimek, I understand the arguments.

And exactly because of these experiences, the guidelines propose that harvesters and portal owners should ensure that all identifiers are included in adms:identifier. If every portal would do that, dynamically a list of equivalent identifiers is being created. And this offers then the potential to implement deduplication algorithms, trusted cross-reference throughout the network of harvesting, ....

It does not impact any portal user experience nor publisher (only technical support to the harvesting community), but the potential is high.

bertvannuffelen commented 5 months ago

This issue will be closed as an reference to the assessment/proposal is in the specification. The assessment/proposal has not been included in full but in this way readers of the specification can better find it and take the considerations into account in their implementations.

SEMICeu / DCAT-AP

identifiers #223