HVD C8. Indicate in reporting the MS responsability

SEMICeu / DCAT-AP

This is the issue tracker for the maintenance of DCAT-AP

https://joinup.ec.europa.eu/solution/dcat-application-profile-data-portals-europe

72 stars 24 forks source link

HVD C8. Indicate in reporting the MS responsability #258

Closed bertvannuffelen closed 6 months ago

bertvannuffelen commented 1 year ago

Each MS has to perform the necessary actions to comply with the HVD regulation.

DCAT-AP describes rules for any catalogues (data portals), the full collection of a MS could be the result of aggregating multiple catalogues. DCAT-AP does not impose in any way to have a a single catalogue per jurisdiction, it is agnostic to that aspect.

proposal

option A - create a MS catalogue for all HVD datasets MS provide a separate DCAT-AP HVD catalogue containing only the metadata that is relevant in the context of the HVD. Note that in case the HVD rules on persistent identifiers this will never lead to duplicates in the aggregated data.europa.eu. The same data can be supplied via different ways.

(+) No new property introduced
(+) Clear scope of reporting
(+) Use of persistent identifiers is mandatory
(+) MS can include information from other catalogues
- (-) Portal system implementers may have to provide support for such separate catalogue.
- (-) Visitors to a portal that would like to see the MS perspective need access to this catalogue. This must be provided by portal implementers.

option B - add property to indicate HVD responsability

(+) Clear indication by a publisher
(+) Probably a trivial extension for portal implementers
(-) High editorial effort
(-) In case of multiple catalogues, all portals should participate
(-) In case of multiple catalogues, no global overview exists unless one maintains it. Thus the reporting may become still complex and may lead to implement proposal option A.

jakubklimek commented 1 year ago

I vote for B.

The way I see it, if every relevant catalogue implements B, creating A from that is trivial, if necessary, given that the APIs of the catalogues are known and standardized. However, this should be enforced by the individual jurisdictions responsible for the reporting. This is the case for Czechia with DCAT-AP-CZ and also Slovakia with their DCAT-AP-SK 2.0.

Maintaining a separate catalogue despite all HVDs also being open data, and therefore typically being required to be present in some open data catalogue, seems redundant.

Note that in case the HVD rules on persistent identifiers this will never lead to duplicates in the aggregated data.europa.eu. The same data can be supplied via different ways.

The persistent identifiers are not necessarily the only identifiers of the dataset. Those can include:

IRI in a local catalog
IRI in a national catalog
IRI in data.europa.eu
identifier other than IRI assigned by the data publisher
...

AFAIK there is currently no way of saying, which of the identifiers is the persistent one (unless we devise a way, e.g. using adms:Identifier). I am not even sure that we can say that the persistent identifier needs to be present in all the places a single HVD can be registered.

Therefore, I am not so sure about the duplicates. At the same time, I do not see the duplicates as a big issue - for me, it is better to have duplicates than to have nothing at all.

matthiaspalmer commented 1 year ago

I vote for B, I also second @jakubklimek point that a special HVD catalog can be generated by filtering on a property (or via more complex mechanisms like checking quality measurments for datasets according to dqv).

bertvannuffelen commented 1 year ago

There is also the usecase that a HVD for a MS could be published by an supranational organisation e.g. Eurostat or OECD. How does a MS then include that dataset in the reporting?

bertvannuffelen commented 1 year ago

Note that in case the HVD rules on persistent identifiers this will never lead to duplicates in the aggregated data.europa.eu. The same data can be supplied via different ways.

The persistent identifiers are not necessarily the only identifiers of the dataset. Those can include:
* IRI in a local catalog

* IRI in a national catalog

* IRI in data.europa.eu

* identifier other than IRI assigned by the data publisher

* ...
AFAIK there is currently no way of saying, which of the identifiers is the persistent one (unless we devise a way, e.g. using adms:Identifier). I am not even sure that we can say that the persistent identifier needs to be present in all the places a single HVD can be registered.

The HVD states each dataset should be provided with a persistent link. As in https://github.com/SEMICeu/DCAT-AP/issues/252 proposed, we must assume that this is the case.

If we state that for exchanging metadata for datasets in the scope of the regulation the DCAT-AP metadata does not provide persistent links, the rules fail to implement the regulation.

The ability that there could be many persistent links for a dataset is really fine. But we have to make a clear statement about which one, one it offering to the EC in the reporting. In the end, that report must be made and in that report each dataset has a single identifier which must be persistent.

My proposal is that the following query

  Select ?hvd where { 
    ?hvd a dcat:Dataset.
    ?hvd r5r:hdvCategory ?cat.
}

is providing the list of datasets that are in the report.

For MS having a single MS portal I hope the result should be the same on the sparql endpoint of the MS portal as on data.europa.eu. (scoped to the MS).

In this proposed query I made the assumption that the URI of the dataset description is the persistent link of the dataset. That is the most DCAT-AP ecosystem strengthening proposal. (It is already implicit, but unfortunately not the practice, see our discussions on identifiers.).

bertvannuffelen commented 1 year ago

Therefore, I am not so sure about the duplicates. At the same time, I do not see the duplicates as a big issue - for me, it is better to have duplicates than to have nothing at all.

The note on duplicates is that creating another catalogue is not introducing duplicates in the DCAT-AP network if there are PURIs used.

Consider 2 catalogues, that describe in detail each dataset, dataservice and distribution.

geo.gov:cat a dcat:Catalog ;
   dcat:dataset geo.gov:d1.

geo.gov:d1 a dcat:Dataset;
   dct:title "Buildings in Gov";
   ...
   dcat:distribution geo.gov:d1-bulk.

and

data.gov:cat a dcat:Catalog ;
   dcat:dataset data.gov:dbus1.

data.gov:dbus1 a dcat:Dataset;
   dct:title "Business register of Gov" ;
   ...
   dcat:distribution data.gov:dbus1-bulk.

then the next catalogue, that only contains references

data.gov:hvd a dcat:Catalog;
    dcat:dataset geo.gov:d1;
    ...
   dcat:dataset data.gov:dbus.

can safely be harvested. RDF based harvesters should not create duplicates, as this is just an RDF merge operation. And I think this should be the case for any kind of harvester.

Thus the effort for a policy officer for the reporting could be maintaining a list of persistent URIs in a catalogue that is harvested by data.europa.eu. And then by querying the sparql endpoint of data.europa.eu the assessment of the reporting can be done. (Thus instead of managing an Excel list, they manage a DCAT-AP catalogue).

jakubklimek commented 1 year ago

@bertvannuffelen

For MS having a single MS portal I hope the result should be the same on the sparql endpoint of the MS portal as on data.europa.eu. (scoped to the MS).

Well, I think this is where the problem is, connected to the various IRIs of a dataset. From a IRI management point of view, each server responsible for a IRI namespace should be in control of that namespace, i.e. assign IRIs to things that should be served under that namespace.

A more concrete DCAT example: a single dataset has the following IRIs:

https://monitor.statnipokladna.cz/api/opendata/monitor/ciselnik-aktivni-organizace in the catalog of the Ministry of Finance, because it runs on the domain monitor.statnipokladna.cz
https://data.gov.cz/zdroj/datové-sady/00006947/0f1dceabc234e73000d944f2466fdb51 in the National Open Data Catalog, because it runs on the data.gov.cz domain, and we want the IRIs to be dereferencable. Here, the original https://monitor.statnipokladna.cz/api/opendata/monitor/ciselnik-aktivni-organizace is preserved in dcterms:identifier
http://data.europa.eu/88u/dataset/https-monitor-statnipokladna-cz-api-opendata-monitor-ciselnik-aktivni-organizace~~1 in data.europa.eu for the same reasons, however, the National Open Data Catalog IRI is nowhere to be found here.

So, the query

  Select ?hvd where { 
    ?hvd a dcat:Dataset.
    ?hvd r5r:hdvCategory ?cat.
}

(should it be r5r or m8g?) run on https://data.gov.cz/sparql and https://data.europa.eu/sparql will definitely yield different results. But I think this is OK, as it is in line with the abovementioned issue with control over a IRI namespace and with the current state of things.

But for us it means that we need to denote somehow the identifier that should be the one.

jakubklimek commented 1 year ago

data.gov:hvd a dcat:Catalog;
    dcat:dataset geo.gov:d1;
    ...
   dcat:dataset data.gov:dbus.

can safely be harvested. RDF based harvesters should not create duplicates, as this is just an RDF merge operation. And I think this should be the case for any kind of harvester.

Sure, but this assumes option A - maintaining a separate catalog.

If we reuse existing catalogs in the catalog hierarchy, each mints their own IRIs for the harvested datasets and they copy (some of) the triples from the harvested catalog to enable IRI dereference on their own domain. This is why I think it may be more feasible to explicitly denote a single "persistent" link explicitly, which would survive the harvesting processes in the catalog hierarchy. And then it is up to the MS to decide who assigns and maintains this link - whether the original publisher, or the national data catalog provider, etc. And then, a catalog like just like the one in option A, can be generated from that, e.g. for the reporting purposes.

bertvannuffelen commented 1 year ago

If we reuse existing catalogs in the catalog hierarchy, each mints their own IRIs for the harvested datasets and they copy (some of) the triples from the harvested catalog to enable IRI dereference on their own domain. This is why I think it may be more feasible to explicitly denote a single "persistent" link explicitly, which would survive the harvesting processes in the catalog hierarchy. And then it is up to the MS to decide who assigns and maintains this link - whether the original publisher, or the national data catalog provider, etc. And then, a catalog like just like the one in option A, can be generated from that, e.g. for the reporting purposes.

You are right this assumes that the harvester maintains the catalogue. It is not guaranteed that this information is transferred throughout the network.

But I think we are here at the fine level of MS-EC reporting. In each MS there is a single contactpoint to provide the reporting to the EC. (I ignore the exceptional case of multiple).
That contactpoint must provide a list to the EC with this is my MS state-of-affairs. And only that list is considered to be HVD for the MS.

Any other dataset that might be annotated as HVD and would show up in data.europa.eu is defacto according to that list in violation with the MS view. As this list is a policy/political decision, the annotation is thus an activity that this contact point has with each individual dataset owner.

My objective is that the policy officers of a MS are doing activities that are in line with the metadata that is provided on the data portals. It is very easy to hand in a list with 20 HVD and at the same time have 100 HVD in a portal, or vice versa. Personally I would find that a failure of the efforts we do for improving the metadata descriptions.

In general, as a visitor of an open data portal, I am not so interested whether or not this dataset is a HVD. I am interested whether or not this data is maintained and has sufficient quality to build a trustworthy decision/system upon. For me the HVD should not lead to a race between MS for the most datasets between HVD (that number is actually meaningless).

So for me, this topic is about how a policy officer of a MS could take benefit from the metadata catalogue expressed in DCAT-AP to provide this list. In this way the reporting to the EC and the annotations to each individual dataset are in line. As scoping is a critical discussion for the reporting (so no running targets), the notion of a catalogue is exactly the notion you want to use. I believe that this is exactly the reason why the term catalogue exists.

bertvannuffelen commented 1 year ago

But for us it means that we need to denote somehow the identifier that should be the one.

if harvesters would implement the proposed identifier guidelines (see https://github.com/SEMICeu/DCAT-AP/blob/2.x.y-draft/releases/2.1.1/usageguide-identifiers.md) then this is easy to address.

To a certain level, HVD requires to implement these guidelines.

bertvannuffelen commented 9 months ago

The section Reporting documents the need for reporting. Concrete reporting requirements that MS must comply to (format and process) are beyond scope of the DCAT-AP HVD.