SEMICeu / DCAT-AP

This is the issue tracker for the maintenance of DCAT-AP
https://joinup.ec.europa.eu/solution/dcat-application-profile-data-portals-europe
76 stars 24 forks source link

Improve recommendation for how to mix downloadable files and Data Services in a Dataset Series #279

Open matthiaspalmer opened 1 year ago

matthiaspalmer commented 1 year ago

There are datasets that are best considered data series with downloadable files but that are also accessible via a data service. We see several alternatives to how to indicate the relation between the dataset series and the data service:

  1. Add an extra dataset in the series with a distribution that points to the data service.
  2. Add an extra distribution on every dataset in the series and point to the data service.
  3. Point from the data service to the data series via dcat:servesDataset.
  4. Add a distribution on the data series that points to the data service.

Alternative 1 is suboptimal as it will disturb any nice ordering of datasets in the series, e.g. if they correspond to yearly downloads there will be one more that breaks the pattern.

Alternative 2 provides many relations when you only need one. It will be messy and error prone for a portal to detect that they are all the same and provide a more high-level presentation that indicates that you can access the whole dataset series from a single data service.

Alternative 3 is doable, but it is not in line with how other datasets are expected to link to data services via a distribution (at least that is our reading, i.e. you are not expected to only provide a dcat:servesDataset without a distribution pointing in the other direction).

Alternative 4 seems most intuitive as it provides a data service for the whole dataset series.

We prefer alternative 4.

However, the following statement argues against alternative 4 in 14.1: "But the presence of these Distributions raise semantical conflicts such as whether the property of the Dataset Series frequency refers to the update frequency of the associated Distributions or the update frequency of the collection. To avoid these semantical conflicts, it is recommended not to associate distributions with a Dataset Series."

We do not think this statement is valid, W3C clearly states that the dcterms:accrualPeriodicity is supposed to be interpreted as the frequency of which new datasets are added. Hence, the semantics is clear, it will not refer to the update of the data. Sure, there will not be a way to talk about the update frequency of the data provided via the distributions of the Dataset series, but that does not matter if we only use the distribution for pointing to the dataset series.

jakubklimek commented 1 year ago

Do you also presume having downloadable files as distributions of the series in case there is e.g. one big download, and then a series of yearly datasets?

I see another alternative, let's say Alternative 5 - if we allow DatasetSeries to be used for "grouping", then you can have a hierarchical series containing the original series and a dataset with the data service. Maybe it seems a bit complicated, but it keeps the DatasetSeries free of distributions.

matthiaspalmer commented 1 year ago

I think the common case is to provide a dataset series just because you do not want / can't provide a single big download. But sure, I see your point, it might happen.

I would prefer if it was possible to agree on a solution that provide a clear indication for a dataportal that a dataset series have a data service (typically to provide a visual indication or a filter). With alternative 5, should we check all dataset in the series and if one have a dcat:accessService, then we should interpret this is for the whole dataset? Or should we recommend this special pattern with exactly one dataset and one dataset series as part of the dataset series?

I think it is more clean to add distributions on the dataset series itself. The semantics that such distributions would correspond to all the data found as part of the dataset series would not come as a surprise to anyone (I think).

jakubklimek commented 1 year ago

I see your point, but then again, there is a bigger problem coming from the fact that 1) DatasetSeries is subclass of Dataset 2) some of the properties (like frequency, or modified) have a different meaning when used on DatasetSeries and on Dataset, which seems to me like a semantic conflict btw.

This is weird by itself, but I could live with that given that DatasetSeries has no distributions. But when they would have their own distributions, essentially becoming full-fledged datasets (which, semantically, they are, but...) it wouldn't be clear if frequency and modified on that series should be interpreted in the series way (how often, or when are datasets added to the series) or in the dataset way (how often or when is data updated in the dataset served through the data service).

matthiaspalmer commented 1 year ago

I agree with the semantic conflict... In fact I think this is a violation of RDFS principles in the W3C spec. To my knowledge you are not supposed to redefine the semantics of a property on a subclass like this. Maybe narrow the semantics but not change it altogether. But, well, I guess you have to choose your fights.

If we for a moment ignore the attempt of redefining the semantics of the properties and consider the original semantics of let's say frequency (dcterms:accuralPeriodicity). In the DCAT3 specification it says in the usage note: "The value of dcterms:accrualPeriodicity gives the rate at which the dataset-as-a-whole is updated."

To me this sounds like when the data is changing in the dataset, i.e. continously, daily, yearly etc. If this is accomplished via a single distribution being updated, if the data changes in a service or if new datasets are added as members of a dataset series does not matter! So, from my perspective we do not need to change the semantics, it will be the same anyway, i.e. when the data is updated in the dataset (because that is how I interpret "dataset-as-a-whole is updated").

I think the same argument can be made for dcterms:modified since it the property is said to reference the actual resource (the dataset itself), not the cataloged resource (the metadata level).

jakubklimek commented 1 year ago

I see your point now. OK, let's see what others think, like this, Alternative 4 would also be fine by me.

bertvannuffelen commented 11 months ago

I did not responded to this issue.

Personnally I would vote for option 3 as the best approach.

An API is for me a separate entity from the dataset. It is not because the API disappears that the dataset disappears or vice versa.

I think the strong connection between Distribution and a DataService you are making is a strong limitation you impose yourself.

For me a Distribution is a "filebased sharing of a dataset that a publisher wants to maintain" (*), while a DataService is any "smart/complex way of providing access to the dataset in a end-user friendly way".

In this reading I give the power to publisher of the data in which ways the publisher wants to make the data accessible. If this is only by files, then in the DCAT metadata I expect only Distributions, while if it is only via an API, I expect only a DataService. (**) If by an ordered set of snapshots, I expect a collection expressed as a DatasetSeries with a Dataset for each Snapshot.

I do not make any assumption on the relationship between a Distribution and the DataService. In many cases, a distribution could hold more data than the API: e.g. a Linked Data endpoint does not need to contain the details of each concept in a conceptscheme but it is sensible to add that to a dump to ensure that the snapshot is selfcontained. Also when a Distribution is a snapshot than an API with the live actual data will not coincide.

Assuming that there must be always a strong connection between a Distribution (in the form of a file) and a DataService ( API) is for me an oversimplication of the practice. Therefore I feel more comfortable with the connection between Dataset and DataService. That is less problematic in semantics and maintenance.

So prior to go into the question whether or not Distributions for DatasetSeries should be allowed I would first like to understand we align ourselves on the expectations on Distributions and DataServices? And then try to understand if your request comes from a stricter interpretation of DatasetSeries (as a representation of a collection) versus a stronger connection between Distributions and DataServices.

(*) I do not consider a Distribution as a representation of the "payload" of a DataService. That would be too finegrained for me. (**) In API-first developments the filebased distribution of the data is never considered. For personal sensitive data that is even a best practice not to do as it brings additional security and GDPR considerations.

jakubklimek commented 11 months ago

I can provide and example from Czechia I discussed just today with one of our Ministries, which aligns exactly with what @matthiaspalmer described in the issue, just to illustrate that these thing really happen:

The ministry has an API for querying the registry of schools in Czechia. There are also significant points in time (4 times a year) when they want to create a snapshot of the contents of the API and publish these snapshots as a time-based dataset series. Now they want to somehow connect the dataset series of snapshots to the data service representing the API.

In DCAT-AP-CZ originally based on DCAT 1, we do not support stand-alone APIs. Contrary to @bertvannuffelen, we view APIs really only as means to provide certain datasets, therefore we always have Dataset =distribution=> Distribution =accessServicce=> Data Service =servesDataset=> Dataset. In this setting, Alternative 3 is not doable and I lean towards alternative 5, i.e. "topical" series containing a dataset with the data service distribution, and the time-based dataset series of snapshots. However, this might just be a DCAT-AP-CZ limitation and if we agree on another practice here, it may be adjusted.

bertvannuffelen commented 11 months ago

In the region of Flanders, we included in the metadata catalogue also all the APIs that are providing access to personal and sensitive data: E.g. https://www.vlaanderen.be/datavindplaats/catalogus/onderneming-geeffiscaleschuld-versie-0200. These are just APIs with a relationship to dataset. In this case it is even more tricky: The service is operated and maintained by the region of Flanders, while the dataset is operated and maintained by the Federal Government. It is thus cross-organisational metadata.

Those services fall in the category of the DGA. They are non-public access services about non-public datasets, yet described in DCAT-AP. These datasets will never be exchanged as a file (or in portions of a file). Only derived datasets that publish statistical/public insights will be published, but then these derived datasets will be properly described as an independent datasets, I assume.

@jakubklimek, is your example like this?

ex:ds1 a dcat:DatasetSeries.

ex:d1 a dcat:Dataset;
   dcat:inSeries ex:ds1;
   dcat:next ex:d2;
   dcat:distribution ex:d1-2020-q1.

ex:d1-2020-q1 a dcat:Distribution
   dcat:accessURL "http://api.schools.cz/download/2020-q1".

ex:d2 a dcat:Dataset;
   dcat:inSeries ex:ds1;
   dcat:distribution ex:d1-2020-q2.

ex:d1-2020-q2 a dcat:Distribution
   dcat:accessURL "http://api.schools.cz/download/2020-q2".

ex:dactual a dcat:Dataset;
   dcat:inSeries ex:ds1;
   dct:frequency "Daily".

ex:api-2 a dcat:DataService
   dcat:servesDataset ex:dactual.

In this ex:ds1 is just a collection of datasets while dactual is a dataset representing a dataset that is daily updated. According to DCAT-AP CZ I expect you probably would add an additional Distribution to ex:dactual, correct?

jakubklimek commented 11 months ago

Yes, regarding the distribution, we would add an extra ex:dactual distribution, pointing to ex:api-2 using dcat:accessService. Regarding the grouping into DatasetSeries, it would be more complex in Alternative 5 - ex:dactual would not be part of the snapshot series, and ex:dactual and ex:ds1 would be in another series ex:top, like this:

ex:ds1 a dcat:DatasetSeries.

ex:d1 a dcat:Dataset;
   dcat:inSeries ex:ds1;
   dcat:distribution ex:d1-2020-q1.

ex:d1-2020-q1 a dcat:Distribution
   dcat:accessURL <http://api.schools.cz/download/2020-q1>.

ex:d2 a dcat:Dataset;
   dcat:inSeries ex:ds1;
   dcat:distribution ex:d1-2020-q2.

ex:d1-2020-q2 a dcat:Distribution
   dcat:accessURL <http://api.schools.cz/download/2020-q2>.

ex:dactual a dcat:Dataset;
   dcat:distribution ex:dactual-api ;
   dct:frequency "Daily".

ex:dactual-api a dcat:Distribution;
   dcat:accessService ex:api-2 .

ex:api-2 a dcat:DataService
   dcat:servesDataset ex:dactual.

#Alternative 5: another dataset series for "topical grouping"

ex:top a dcat:DatasetSeries .
ex:ds1 dcat:inSeries ex:top .
ex:dactual dcat:inSeries ex:top .
bertvannuffelen commented 8 months ago

@matthiaspalmer and @jakubklimek , do we agree that this topic is sufficient addressed in this issue. And that the current DCAT-AP can handle both cases, but that the choice is up to implementers.

Given that conclusion, can we also close this issue in the release of DCAT-AP 3?

jakubklimek commented 8 months ago

I agree that the current DCAT-AP allows for all the alternatives discussed here, and that more restrictive rules such as the alternatives proposed by @matthiaspalmer can be imposed on a lower level like DCAT-AP-CZ or DCAT-AP-SE.

However, the original issue was if DCAT-AP itself should provide guidance on which of those alternatives are supported and which are not. There does not seem to be a strong interest (other than by us 3) in this topic - the remaining question for me is what the resolution should be so the issue can be closed. Since it seems that none of the restrictions will make it into DCAT-AP, should there be at least a note pointing to the (closed) issue saying something like "Usage of data services in dataset series was discussed here, no guidance is provided on purpose, feel free to raise/reopen issue if you think there should be further discussion"?

matthiaspalmer commented 8 months ago

TLDR; Lets go with alternative 3 and make it into a strong recommendation if not even a requirement in DCAT-AP3.

If we do not provide guidance there will be different solutions implemented in different member states, e.g. I am resonsible for suggesting a solution for DCAT-AP-SE3.0 in a month time around this and other issue. (Also in MetaSolutions we will also implement support in EntryScape, likely within a year.)

A lack of a recommendation around this issue in DCAT-AP3 will be problematic for implementors. I suspect it will be worst for data.europe.eu. Last time I had a conversation they where still struggling with having independent data services (not being referred to from at least one dataset). And adding the uncertainty of how that will relate to dataset series will not make things easier for them. I.e. to be compliant they would have to support ALL different possibilties. Which unfortunately leads to poor user interfaces, e.g. people won't be able to easily find the API for a dataset series.

I want to reiterate that I think the point of having an application profile is to restrict the generically defined classes and properties into a configuration that fits with a specific need. That includes making things more concrete for implementors. I.e. I think we have to be a bit bold here and suggest something, worst case scenario is that we have to backtrack in the future if there are scenarios that cannot be fullfilled.

Also, we can also provide it as a strong recommendation, not neccessarily a hard requirement.

After reading the Candidate Recommendation of DCAT3 I realized that dcat:distribution is NOT among the properties listed that are available for reuse from the super-classes, see Dataset Series. Hence, this clearly means that I am in the minority in my idea of how to use Dataset series in a creative way. Therefore, I change my mind and do not suggest alternative 4 anymore. (Although I still find it problematic with the approach taken in DCAT3, e.g. redefining semantics of inherited properties on a subclass and related issues.)

To conclude, I suggest we recommend alternative 3 whenever a dataservice provides ALL the data contained in a dataset series, i.e. to point from the Dataservice to the Dataset series.

  1. Not recommended
  2. Not recommended
  3. Not recommended
  4. Prefferred
  5. Allowed

I am not sure 5 needs to be discussed at all though since in issue 275 we agreed that having dataset series in dataset series is allowed.

I assume that of the three of us that cared about this issue (up to now) we are in agreement. I.e. @bertvannuffelen originally preferred 3, I changed my mind and @jakubklimek can take the approach 5 as long as Czechia are not supporting independent dataservices (or stick with solution 5 in the future as well if that is preferred).

bertvannuffelen commented 7 months ago

In the draft release in section 14.2 an extra bullet is included that incorporates the alternative 3 as recommendation.

jakubklimek commented 7 months ago

@bertvannuffelen Isn't there a problem connected to #289 ? The recommendation now states that a Data service should point to a Dataset series using dcat:servesDataset. However, that property has a range dcat:Dataset. DCAT-AP DatasetSeries is currently not a subclass of DCAT-AP Dataset (as per #289).

bertvannuffelen commented 7 months ago

fair point. Two approaches to look at it:

The last option we could add with an additional sentence in the recommendation.

jakubklimek commented 7 months ago

@bertvannuffelen Option number 2 seems like an explicit statement of what is the meaning of using that property in this case. Good at least for acknowledging that we are aware of the situation and its consequences. There is always option number 3 that would also resolve #289 and that is to make a DCAT-AP Dataset a subClassOf DCAT-AP DatasetSeries, but that is a discussion for #289.