Informatievlaanderen / OSLOthema-DCATAPVlaanderen

0 stars 1 forks source link

Dataset requires dct:identifier #15

Open sandervd opened 2 years ago

sandervd commented 2 years ago

The rule in the SHACL shape https://data.vlaanderen.be/shacl/DCAT-AP-VL#DatasetShape/972d73e7a13100b66c0c2f44466edac47aa1ab28 mandates that a dataset specifies a dct:identifier. This however encourages had habits, as in RDF the subject should already be the persistent identifier of the object.

This should be an optional property.

bertvannuffelen commented 2 years ago

@sandervd this is a conscious choice to impose it. Not all DCAT data catalogue descriptions have their master data in RDF. And even if it is RDF, a valid RDF can use blank nodes for the datasets.

By imposing an identifier and by preference a dereferenceable URI the harvesting can do its work safely. The challenge for the identifiers is also describe in larger extend in https://github.com/SEMICeu/DCAT-AP/issues/223.

Since there is no agreement by the portals to use the URIs of a DCAT dataset as the identifier of the dataset, the network of catalogues actually inserts copies into the network. E.g. data.europa.eu creates a new URI for each dataset it harvests. So if you query the union of data.europa.eu and datavindplaats together you get your data RDF wise twice.

This is the reason for introducing identifier management in a way that one can detect duplicates not by comparing the title, but by comparing an identifier.