HVD C1. Being part of HVD scope

bertvannuffelen commented 1 year ago

The HVD defines 6 thematic data categories: geospatial; earth observation and environment; meteorological; statistics; companies and company ownership; mobility

Proposal: Create new property m8g:hvdCategory defining the HDV category to which this resource belongs. The codelist will be created and maintained by the Publications Office. A resource may belong to more than one data category

bertvannuffelen commented 1 year ago

In https://github.com/SEMICeu/DCAT-AP/issues/231#issuecomment-1317163139 a single, data category agnostic, indicator was proposed.

The above proposal C1 addresses at the same time the proposal from the comment as the need to know in which HVD data category the dataset is associated.

anarosa-es commented 1 year ago

What about to create a sub-class "High Value Dataset"? It can add any property and relationship specific to HVD. This could also have the six specific sub-clases currently provided by the regulation if they have some specifities to model, such as INSPIRE XML distribution

jakubklimek commented 1 year ago

@anarosa-es please see https://github.com/SEMICeu/DCAT-AP/issues/231#issuecomment-1327339254 for the discussion on this topic

bertvannuffelen commented 1 year ago

I want to raise attention that in all rules discussed are to be read as: The rule is applicable when the dataset is subject to the HVD regulation.

It means that if you have a dataset, or a distribution or an API that is not subject to the regulation, you should not indicate it is subject to the HVD regulation. But inversely, if the dataset is part of the HVD regulation and the MS wants to report about this in DCAT-AP, then these rules should be followed.

MS have the ability according to the HVD regulation not use DCAT-AP as unique metadata format for HVD. But in the end the reporting should be provided in that case and then all metadata requirements expressed in the regulation should be fulfilled.
A MS could take benefit from having all metadata in a single metadata format represented to easy the followup and reporting, but that is not a requirement of the HVD regulation. As DCAT-AP is cross domain, the likelihood that the reporting requirements can be realized from this approach.

sirex commented 1 year ago

In order to ensure, that a dataset and all distributions of a datasets are HVD compliant, I would use existing ways, to define relationship between a dataset and other resources:

https://www.w3.org/TR/vocab-dcat-3/#qualified-relationship

That means, if I have a dataset "Addresses", then I could create a derived version of that dataset, fully compatible with HVD requirements, while leaving original "Addresses" dataset, containing distributions, that are not HVD compliant.

Basically there would be a hierarchy of datasets:

`- Official national register for addresses (not open data)
    |- Addresses (random, poor quality open CSV files, published somewere)
    |- Addresses (HVD compliant data)
    `- Addresses by region (Data Series, another publication of same data, but in a different data organization)
      |- Addresses of Region 1
      `- Addresses of Region 2

This way, we will know the origin of all the datasets and each dataset could be tuned for different uses cases or regulations.

So this way, if a dataset is tagged as HVD, all distributions of that dataset MUST also be HVD compliant without explicit tagging.

If a HVD dataset contains non HVD distributions, then a new dataset should be created to split distributions into HVD and non HVD.

bertvannuffelen commented 1 year ago

@sirex The Qualified Relationship is an option for encoding a related network of entities.

You also suggest to apply implicit assumptions on compliance. Personally, I am reluctant for this, as I have the feeling it is not a general case. Secondly, my reluctance comes from that the legislation might impose just a little requirement (e..g. a bulk download with an open licence). If the implicit rule would make all other distributions also in scope of the legislation, then this might be non-intented side-effect of the interpretation. A too wide interpretation potential is a risk: it might give users a false impression.

sirex commented 1 year ago

My suggestion, to use explicit HVD tagging only on datasets, and if dataset is tagged as HVD, then all distributions of that dataset MUST be also HVD compliant.

If a HVD dataset has some distributions, that are not HVD compliant, then such distribution, should be split into two distributions, where one contains only HVD compliant distributions and another, all other non HVD compliant distributions.

So I think, this is still pretty explicit and can be validated automatically. Also, this would help to separate concerns, dataset defines scope and distributions MUST fit into the scope defined by a dataset. So if dataset declares, that this is a HVD dataset, then any non HVD distributions, can't be part of this dataset and must be published as part of another, non HVD dataset.

bertvannuffelen commented 1 year ago

If a HVD dataset has some distributions, that are not HVD compliant, then such distribution, should be split into two distributions, where one contains only HVD compliant distributions and another, all other non HVD compliant distributions.

Do you mean 'split in two datasets' instead of 'split into two distributions' ?

Personally I tend to avoid such rules, because the more complex metadata structuring rules there are the more difficult it is to maintain them, and to have them created.

From an end-user perspective, having 3 datasets in your portal called: 'Addresses', 'Addresses HVD and ''Addresses INSPIRE' are confusing. Is there a difference between them? Are they connected? Which one to use? Let's leave this choice open for the implementers; as each portal has it "deduplication" rules, it is hard to impose universal one. This avoids to introduce specific DCAT-AP HVD deduplication rules.

While I understand your rule for distributions (file dumps) there is a similar case to be made for APIs: in many geospatial domains a single data service (API) is used to serve many datasets (each layer is a different dataset). How do you deal with that? Should a API only serve HVD datasets? Or can it be mixed? If we follow the splitting rule, you impose a change to the INSPIRE community, which should be accepted by them. We cannot impose a rule in DCAT-AP HVD that is violated by an accepted practice in a specific domain, unless the HVD forces a specific interpretation (e.g. URIs for licence information).

(Ps. Note: if this splitting eases your portal management and HVD reporting management, then you can still create a MS specific rule.)

sirex commented 1 year ago

Since splitting datasets by time or places like Budget 2010, Budget 2011, ..., Budget 2022 is considered as a good practice, so I see consistency in 'Addresses', 'Addresses HVD and ''Addresses INSPIRE'.

Splitting datasets not only by time or places, but also by type?

Is there a difference between them?

Yes, these datasets would differ by dct:type (assuming dct:type would be used to tag HVD datasets).

Are they connected?

Yes, see Relations between datasets.

Should a API only serve HVD datasets? Or can it be mixed?

API should serve mixed datasets. But specific API endpoints, specified via distributions (dcat:accessService) should be HVD compliant.

If we follow the splitting rule, you impose a change to the INSPIRE community

As I understand INSPIRE and DCAT are two alternatives of the same thing? At least we import INSPIRE datasets and convert them to DCAT, but imported INSPIRE datasets are read only, because because primary data source of INSPIRE datasets is another metadata portal (geoportal).

So if INSPIRE does not support HVD tagging, I see the only thing we can do, is to extend INSPIRE datasets, by creating an editable copy, which replaces INSPIRE dataset and where we can add HVD tagging. At least that is the plan in our case.

bertvannuffelen commented 1 year ago

@Sirex, I start from the premisse that the metadata is provided by the publishers. And thus that each metadata record has its counterpart in a real entity that is managed by that publisher. I assume also that this publisher has a single identifier for each entity it registers metadata of.
(I thus assume the publisher has not to maintain the coherency of the metadata manually: this is the internal organisation representation, the INSPIRE representation, the open data portal representation, the mobility portal representation, the bibliography representation, etc. If you enter the same data manually in different catalogues you know you create duplicates, and thus there is a conscious act of misleading the public or imposing extra effort to keep in sync. )

Given this context, I am very hesitant to ask a dataset publisher to duplicate its metadata just in order to make the catalogue maintainers life easier. In particular in this case, most datasets in scope of HVD are already in a dcat-ap catalogue.

Secondly, what if you have other legislations like DGA pop into the game, then this pattern would create difficulties. Then the pattern might lead to Addresses DGA not HVD, Addresses DGA and HVD, Addresses HVD not DGA. Personally I believe, that in the end the portal visitors are not concerned too much with this distinction.

That is the reason why I leave the structuring to the publisher. That structure reflects its effort to serve the data community. So when the publisher likes to split budget in separate datasets (grouped in a series), it is fine for me. But the motivation for that structure should be made by the publisher. It is actually interesting case: namely one sees with APIs a reduction in the need to have this snapshots, but also when they are used, after a while publishers want to get rid of the old distributions. And ship them of to a digital archive. When that happens, an overarching metadata catalogue could create this view very easy with a Dataset Series. Typically when it is shipped to the archive the reference to the data is removed from the original publisher. As a metadata catalogue is overarching both organisations, the information to find the historic data can be still nicely served to the user as if nothing has changed.

sirex commented 1 year ago

Create new property m8g:hvdCategory defining the HDV category to which this resource belongs.

What is m8g?

Why a new property is needed instead of using dcat:theme?

jakubklimek commented 1 year ago

m8g is the namespace used by the Core Vocabularies.

dcat:theme could be used as well, but there the HVD categories would be mixed with other themes and you would need additional data (the IRI of the ConceptScheme (skos:inScheme of the concept)) to know that it is an HVD category. A new property would make this distinction more obvious and without the need to download/duplicate external data - the HVD category codelist.

bertvannuffelen commented 1 year ago

To summarize the DCAT-AP HVD approach:

use r5r:applicableLegislation to denote if a resource (e.g. dataset) is within scope of the HVD IR by providing a reference to the HVD IR ELI.
use r5r:hvdCategory to denote the (toplevel) category as named in the HVD IR.

It is an additive approach to existing metadata.

jakubklimek commented 1 year ago

@bertvannuffelen I did not find anywhere the reason why the controlled vocabulary used with r5r:hvdCategory only contains the 6 toplevel categories and not the lower level consisting of "datasets in scope" such as "Weather alerts" or "Radar data".

Those are identified in the HVD IR Annex and therefore could be used to improve search for e.g. "Weather alerts" from the EU. Or is there any other way to easily facilitate such search?

bertvannuffelen commented 1 year ago

@jakubklimek today the HVD data categories are more to provide a link with the specific HVD IR requirements rather than an enumeration of specific datasets.

There is no complete enumeration of specific datasets: those dataset names in the IR are more indications of the scope rather than imitative and precisely naming.

Your example works for meteorological domain, while for the Earth Observation and Environment domain this is just a list of themes and articles in legislation.

The HVD category has not the intent to harmonise dataset names or to group them for higher findability. It is more a policy mean to ensure that datasets are maintained on a higher quality according to the best practices in each domain. That is the reason why there is such diversity in the detailed level.

Nevertheless it might be an outcome of the implementation of HVD that "similar datasets" across Europe, get similar tagging/naming. For instance, if there is within the domain of meteo the agreement that the data collection method (radar, weather station, public sources) is expressed as a keyword (or better as a subject), then portal software could lead users easily to other data that is HVD and using the same data collection method. Instead of creating yet another subcategory my suggestion would be that publishers of HVD datasets should invest in publishing the most precise and quality metadata That focus will have more impact I believe.

jakubklimek commented 1 year ago

@bertvannuffelen Thanks for the explanation. So if I understand it correctly, you suggest to leave to the individual domains to establish their own controlled vocabularies of themes in their domains? Do you know if in any of those domains, such activities are on-going? Or is the intention to leave this open-ended for, let's say a couple of years, and then re-visit the topic once we actually have some HVDs published?

SEMICeu / DCAT-AP

HVD C1. Being part of HVD scope #251