SEMICeu / DCAT-AP

This is the issue tracker for the maintenance of DCAT-AP
https://joinup.ec.europa.eu/solution/dcat-application-profile-data-portals-europe
74 stars 24 forks source link

Need for a common approach to modeling dataset series in DCAT-AP #155

Closed aidig closed 2 months ago

aidig commented 4 years ago

The need for a common approach to modeling dataset series has already been identified as a significant outstanding issue in DCAT (https://github.com/w3c/dxwg/issues/868), and it will hopefully be adressed in DCAT 3.

However, in the meantime, various national and domain specific profiles of DCAT-AP 2.0 already suggest to implement structures to handle dataset series despite DCAT or DCAT-AP not offering the necessary properties/class or specific guidelines for this directly in the specification documents. Several approaches seems to indicate use of dct:hasPart/dct:isPartOf although other proposals have also emerged.

It would be beneficial if this issue could be prioritised in DCAT-AP future work.

jakubklimek commented 4 years ago

I can confirm that the Czech National Open Data Catalog (https://data.gov.cz) will soon be implementing dataset series through dcterms:hasPart/dcterms:isPartOf.

aidig commented 4 years ago

Also, the article - DCAT-AP: How to model Dataset series? - has previously been published but the document is 4 years old and the status is unclear. Link: https://joinup.ec.europa.eu/release/dcat-ap-how-model-dataset-series It states "switch to the latest release" and redirects to https://joinup.ec.europa.eu/release/which-processes-and-tools-could-be-used-manage-quality-metadata/10

aidig commented 4 years ago

Related references:

bertvannuffelen commented 3 years ago

@aidig thanks for the good overview. Lets work towards a clearer proposal

aidig commented 3 years ago

There are several examples of a an approach not mentioned in the above list, namely specifying the annual 'versions' as dcat:Distributions.

For instance: https://data.europa.eu/euodp/da/data/dataset/DAT-105-enta/dataset/transparency-register https://www.europeandataportal.eu/data/datasets/e6d7b3ac-1ef1-476f-aed2-15645ba60248?locale=en

This approach does not seems to not take into consideration the DCATs note on the use of dcat:Distribution - that is "all distributions of one dataset should broadly contain the same data." DCAT does also state that "the distributions might have different levels of fidelity to the underlying data" and the interpretation is 'application specific', but such use seems problematic and advice and recommendations from the DCAT Application Profile is still required. Ref: https://www.w3.org/TR/vocab-dcat-2/#Class:Distribution

It would be great if the DCAT-AP's proposal for guidelines on this topic could address this too.

aidig commented 3 years ago

In addition, note that JoinUp uses dct:isVersionOf (in an ADMS-AP to link solutions (eg. a vocabulary modelled as a dcat:Dataset) to a release (eg. a versioned vocabulary modelled as a dcat:Dataset). A different scenario to times series, but relevant for scoping the properties needed.

See related issue: Modelling three-level structures with DCAT/ADMS #149

aidig commented 3 years ago

The DCAT Application Profile for Base Registries (bregDCAT-AP) has - as noted in the above - already made the decision to model relationships in which datasets are contained in other datasets, that is, a dataset is a subset of another using dct:hasPart/dct:isPartOf and state that similar mechanism adopted in the future should be based on these Dublin Core terms.

To ensure interoperability, please ensure close coordination and collaboration between DCAT-AP and bregDCAT-AP.

Generally, there is a need for modelling a dataset that is part of another dataset, and one can only hope that the various profiles of DCAT take the same approach in modelling this relationship.

andrea-perego commented 3 years ago

To complement your survey, @aidig , DCAT-AP_IT (the Italian profile of DCAT-AP) provides guidelines on the use of dct:hasPart , dct:isPartOf , dct:hasVersion , dct:isVersionOf :

https://docs.italia.it/italia/daf/lg-patrimonio-pubblico/it/stabile/modellometadati.html#come-gestire-le-relazioni-tra-dataset

I include below the (automatic) English translation:

How to manage relationships between datasets

The European vocabulary DCAT treats the main conceptual dataset entity as independent, seen only in relation to the catalog and its distributions. However, in practice, more complex relationships emerge between datasets, as in the case of datasets (eg, time series), versionings, portions of a larger dataset, or collections (i.e. datasets that belong to a general topic but are based on different dimensions, also based on specific use cases; an example is the case of the election results datasets). This current lack of the DCAT vocabulary also affects the European DCAT-AP profile which in any case provides recommendations for possible implementations in the presence of these complex relationships.

 NOTE

In the context of these guidelines, the relevant European recommendations are adopted .

In particular, although administrations are encouraged, where possible, to limit the proliferation of datasets , in order to model their inter-relationships, some representation methods are listed below:

  • in the case of versioning , the current Italian profile DCAT-AP_IT already provides for the use of the Dublin Core * dct: isVersionOf * property ; administrations can also use the reverse property dct: hasVersion in addition to create a relationship between two different versions of the data. However, it is not recommended to create new datasets for small data changes. Instead, it is recommended to define new datasets only in the presence of significant changes compared to previous versions (eg, new elements included, significant adaptations of some elements, etc.);

  • in the case of data series, views on datasets and collections it is recommended to adopt the following solution:

    • Emphasize the series, view or collection itself, creating a single dataset for it whose members are different distributions of the created dataset.
    • However, where such a solution is difficult to apply, it is also possible to emphasize the individual elements of the series, views or collections. In this case, however, it is advisable to proceed as follows:
      • create a series-type dataset, using the Dublin Core dct: type property which it takes as a value; < http://inspire.ec.europa.eu/metadata-codelist/ResourceType/series >;
      • create for this dataset many members, in turn datasets, specified by the Dublin Core property dct: hasPart ;
      • individual datasets that are members of the series will have a Dublin Core dct: isPartOf property that binds them to the initial series dataset.

aidig commented 3 years ago

Many thanks for the info @andrea-perego! Much appreciated :-)

How does the below solution align with the semantics of dcat:Distribution and the corresponding W3C note: "all distributions of one dataset should broadly contain the same data" . DCAT also states though that "the question of whether different representations can be understood to be distributions of the same dataset, or distributions of different datasets, is application specific."

  • in the case of data series, views on datasets and collections it is recommended to adopt the following solution:

    • Emphasize the series, view or collection itself, creating a single dataset for it whose members are different distributions of the created dataset.
andrea-perego commented 3 years ago

I think it complies with it. The dcat:Distribution NOTE in DCAT2 was included following requests for guidance - and it is not prescriptive. It gives an indication about the default approach to be used, but it recognises (as stated in the sentence you cite) that alternative solutions are applicable as well, depending on the requirements of the application scenario.

pebran commented 3 years ago

I can confirm that the Czech National Open Data Catalog (https://data.gov.cz) will soon be implementing dataset series through dcterms:hasPart/dcterms:isPartOf.

@jakubklimek will that be with direct use of the DC properties or by creation of more specifik subproperties?

jakubklimek commented 3 years ago

@pebran It will be the direct use of dcterms:isPartOf.

init-dcat-ap-de commented 1 year ago

So we now have two types of Datasets, those inSeries and normal Datasets.

The Dataset member of a Dataset Series has only 6 properties. title, description and frequency are als part of the "normal" Dataset. One could think that these are the only 6 properties we want to see in the a Dataset member of a Dataset Series. But I think that's wrong. As far as I can see it, these three properties have different usage texts than their "normal Dataset" counterparts.

I think this should be better explained.

bertvannuffelen commented 1 year ago

Indeed, there are 2 types.

Observe that the type InSeries Dataset is a subclass of a normal Dataset. That means that all properties of a normal Dataset apply to those of an InSeries Dataset. That is the nature of a subclass.

Only the properties that require special attention for an InSeries Dataset are included for that class. These are the mandatory normal Dataset ones, those with their updated usage guidelines and constraints, and those that are unique for this scope. That allows readers to have a focused view.

So we rely on that users understand the notion of a subclass as: "all rules and constraints of the superclass apply to me".

We could add in the class usage note an additional sentence such as "This class is a subclass of Dataset and therefore all properties with with their constraints apply to this. For readability purposes these are not copied to this class."

Note that a similar general statement w.r.t. DCAT is mentioned in the last paragraph of https://semiceu.github.io/DCAT-AP/releases/3.0.0/#specoverview.

init-dcat-ap-de commented 1 year ago

I think the subclass relationship is difficult because it's technically not a subclass. It uses the same URI as the "normal" Dataset.

My suggestion would be to remove the subclass relationship and adjust the usage note to something like this:

If a Dataset is used as part of a DatasetSeries, the properties listed here can be used additionally, or slightly differently to those listed for the Dataset outside of a DatasetSeries.