SEMICeu / DCAT-AP

This is the issue tracker for the maintenance of DCAT-AP
https://joinup.ec.europa.eu/solution/dcat-application-profile-data-portals-europe
74 stars 24 forks source link

Application of DataService Class #144

Closed fabiankirstein closed 2 years ago

fabiankirstein commented 4 years ago

This topic is not new, but is highly relevant and not solved yet. (https://github.com/SEMICeu/DCAT-AP/issues/109, https://github.com/SEMICeu/DCAT-AP/issues/100) For a practical application more precise use cases and application guidelines are required. The overall question would be: How and in which cases should a data provider include the DataService class in their data.

DCAT-AP 2 includes the DataService class in two places: Catalog and Distribution.

DataService in a Distribution This case is kind of clear and makes sense, although some questions are raised, too. Some of them are already addressed in this issue: https://github.com/w3c/dxwg/issues/1126 However, some concrete examples could help. Especially, how the property dcat:servesDataset relates to other datsets in one catalogue.

DataService in a Catalog Here a DataService becomes a high-level entity. Does that mean, that users can browse for services in a catalogue, like they can browse for datasets? If so, the DCAT-AP specification for DataService lacks some important properties, which are present in datasets, like issued or modified. Furthermore, is it likely that one data service which is referenced in a distribution is also referenced in a catalogue? What would be the purpose of that?

bertvannuffelen commented 3 years ago

This issue bears also the discussion on distinguishing a distribution from a dataservice. Probably we cannot create a clear line between them, but maybe this following lines of thought can be helpful.

A Distribution is an entity that cannot exists on its own: it only exists in the context of a Dataset. So we have to look at the relationship between Dataset and Distribution to understand them. For me a Dataset is an abstract notion referring to a bag of data. A Distribution about that Dataset, is the digital materialization of the bag of data. It contains all the information the dataset has. To access the Dataset, one has to go through the access method a Distribution offers. The simplest format of digital materialization is a file. By downloading a file one obtains a copy of the distribution, thus a complete materialization of the dataset. It is like buying a paper book in a shop. That book is now my copy which I can manipulate in the way I want (according to the legal restrictions that are connected with this book).

From the moment I need a more complex mechanism to access this bag of data, e.g. a smart agent/client, then we shift into the service perspective. E.g. most REST APIs will not offer direct access to the complete bag of data with a simple request, but it requires a complex manipulation of iterating through the data. Even sometimes one cannot download the complete dataset as the bag of data is constant in flux (e.g. consider any social network receiving millions of updates a second). To my metaphor: if the book is only available as a one page per day in a journal, then that journal is the service delivering you the book. So you need to buy a subscription to the journal to read the book; and if you want to re-read it you need to keep all copies of that journal.

Another aspect is the a form of independence of a service to the actual served bag of data. A dataset is a collection of data according to some vocabulary/schema. Now if that vocabulary changes severely (v1 -> v2), then this has severe impact on the metadata descriptions of the dataset and distributions. Namely a v1 distribution cannot be seen anymore as a valid distribution of a v2 dataset. It would violate the natural relationship that a distribution is the materialization of the bag of data described by the dataset. A valid approach would be then to create a new dataset description for the v2 schema, and relate it to the v1 dataset with a dct:hasVersion relationship.

Services tend however to be treated more independently of the actual served bag of data. They might expand, change and reform themselves in a way that is not visible from the outside. E.g. some services offer content negotation, allowing smart client to select the scheme of desire. So there is one service, but actually serving multiple datasets. Even so a service might do a semantical shift without changing one line of code if the data that is being served is structurally equivalent. E.g. a geospatial service offering access to forests, can provide access to data under the definition of a forest is a green area with at least 2 trees on, or a forest is a an area of minimally .5ha large with complete coverage of trees.
On the service side there is no difference, however the data supplied is incomparable.
This independence of important semantical aspects makes a service attractive a reuseable component in its own right. A file as distribution of dataset is that by definition not; for a distribution the connection with the semantics of the dataset is important, it is somehow part of its identity.

Beyond these general considerations, we should also consider if there are requirements for dataservices in the context of (Open) Data Portals. What should people know about these services more than an landingpage/endpointDescription? Or even, should services be made visible and we only should focus on Datasets? Maybe we can learn from the INSPIRE community? They share descriptions of services already a while and maybe there are interesting experiences to learn from.

fabiankirstein commented 3 years ago

@bertvannuffelen Thanks for the detailed statement. If I understand you correctly you made a point why a DataService can be both, a property of a distribution and/or of a catalog. Your argumentation makes very much sense. However, if that is the intention I still do not fully understand the application of it in DCAT-AP. Here are still issues I have with the respective interpretations.

DataService in a Distribution: How you describe it, this would only be some kind of "advanced" distribution, just another way to access the data of the dataset. I think it is valid to have this. But then many properties of DataService are redundant or do not make sense. E.g. dcat:servesDataset, dct:accessRights or dct:license would be redundant, since the distribution/dataset already contain these information already.

DataService in a Catalog: This looks clear to me. A catalog can have Datasets and DataServices. But then DataServices are missing a lot of metadata to be handled in the same way as datasets. Actually DCAT is proposing much more properties than DCAT-AP does. (https://www.w3.org/TR/vocab-dcat-2/#Class:Dataset). And what is the advantage to put a DataService directly to a catalog, if you can get the same result, by putting the service in a distribution and in a dataset. I really thinking from a users perspective here. Usually you get a bunch of datasets. What would be the benefit for a real end user to have this high-level distinction?

I would be very happy for further discussions about this topic.

bertvannuffelen commented 3 years ago

@fabiankirstein I start with your second comment.

DataService in a Catalog: This looks clear to me. A catalog can have Datasets and DataServices. But then DataServices are missing a lot of metadata to be handled in the same way as datasets. Actually DCAT is proposing much more properties than DCAT-AP does. (https://www.w3.org/TR/vocab-dcat-2/#Class:Dataset). And what is the advantage to put a DataService directly to a catalog, if you can get the same result, by putting the service in a distribution and in a dataset. I really thinking from a users perspective here. Usually you get a bunch of datasets. What would be the benefit for a real end user to have this high-level distinction?

This is a topic we should as DCAT-AP community get a common strategy or reading of the DCAT-AP specification on. It is about reusing DCAT.

Either we state that DCAT-AP is a self-contained document containing only the properties that are within scope of DCAT-AP. According to your reading, this indeed means that a DCAT-AP:DataService does not have e.g. a publisher. Only a Dataset has.

Or we apply the reading that in-case DCAT-AP does not expresses any additional constraints, nor any specific usage conditions for a property, users of DCAT-AP can use properties that are described in DCAT. According to that reading DCAT-AP:DataService can have a publisher.

On the one hand the self contained approach seems to give must more trust, on the other hand it creates a not neglectible amount of editorial work to copy specifications around. Making it hard to see what actually distinguishes DCAT-AP from DCAT. E.g. currently DCAT 3.0 is in design, making that all editorial changes should be reflected in a self-contained document.

Personally I am more on the second approach side, because it offers a good guideline for an ecosystem of specifications. For instance, if a MS wants to register the publisher of a DataService it can do it using dct:publisher because DCAT expresses that this is the property to be used even that is not mentioned in the DCAT-AP specification. That information does not render the MS catalog incompatible with DCAT-AP.

bertvannuffelen commented 3 years ago

Now on your second thought in this reply:

DataService in a Catalog: This looks clear to me. A catalog can have Datasets and DataServices. But then DataServices are missing a lot of metadata to be handled in the same way as datasets. Actually DCAT is proposing much more properties than DCAT-AP does. (https://www.w3.org/TR/vocab-dcat-2/#Class:Dataset). And what is the advantage to put a DataService directly to a catalog, if you can get the same result, by putting the service in a distribution and in a dataset. I really thinking from a users perspective here. Usually you get a bunch of datasets. What would be the benefit for a real end user to have this high-level distinction?

I like your perspective: namely the user. Reformulating: What information do we want to collect which is beneficial for our users. I see two approaches in the (Open) data portal communities: a) we collect as much as possible in a catalog. What our publishers can provide we share and b) we focus on the information we would like to share and which can create a good offering for our users.

The first approach will typically relax as much constraints because one can always find an exception in which does not fit. The second tends towards to impose stricter rules, consciously (or maybe sometimes not) ruling out cases.

The first approach is corresponds with a DCAT exchange in our context. That would be the bare minimum. DCAT-AP shifts towards more constraints, but in the practice many constraints are lightweight.

This brings me to your question: what kind of information we want to share in our (open) data portal. In the practice, one can split the data users in 2 categories: users that search for a downloadable file, or users that search for an API. Both are really distinct users. So far DCAT-AP has been tilted mostly towards the first category. Even the definition of Distribution (see https://w3c.github.io/dxwg/dcat/#Class:Distribution) indicate somehow a closure. Properties like downloadURL, checksum, etc, do not fit very well with the notion of an data API. But of-course the term Distribution is so wide interpretable one could shift an data API under it.

Lets consider the other kind of users. They are not concerned with the dataset as a closed verifiable downloadable file, they want to connect to the data API. They want to know what are the conditions to be granted access, what are the queries one can do and what are the error messages received. In there world a notion of dataset and the API being a distribution often does not exists. The only thing that exists is the API.

Also one has to consider the word Service in DataService. Most used APs are those that deliver a service: they do not provide data about a single entity in a specific context, but they combine information so that data flows are facilitated.

Because of the two perspectives exists I believe we should as DCAT-AP community discuss on how we best serve them. In the end, the DCAT-AP community should consider if the collected data can be presented to the above user groups and if they feel supported. Today, the REST API community is using OpenAPI as specifications and create API documentation portals out of it, creating a parallel ecosystem of data service portal descriptions to the DCAT-AP based portals. That is for me an indication that there is room and need for a discussion on how these can be integrated in the DCAT-AP ecosystem. Our challenge as DCAT-AP community is to make choices that would enable publishers of datasets, downloadable files and APIs to describe their assets in such a way a smart assistant for both users groups can be build.

To give you examples of services that do not have clear dataset-distribution notions:

bertvannuffelen commented 3 years ago

@fabiankirstein

DataService in a Distribution: How you describe it, this would only be some kind of "advanced" distribution, just another way to access the data of the dataset. I think it is valid to have this. But then many properties of DataService are redundant or do not make sense. E.g. dcat:servesDataset, dct:accessRights or dct:license would be redundant, since the distribution/dataset already contain these information already.

Suppose a publisher considers to describe a REST API as a Distribution and as a DataService, then we need to be attentive for good documenting practice.

There are 3 options:

In the first case A) the challenge is describe what is the difference between Distribution and DataService as clear as possible because the intention is to achieve a A.1) situation. Unfortunately in the practice we might observe A.2) happening more than wanted. If A.2) has to be avoided quality measures should accompany the explanation. E.g. if the accessURL or downloadURL are very similar to the endpointURL then it indicates a problem.

If option B.1) is chosen, then we have to align the semantics of the common properties. For instance, the language of a distribution and the language of a data service, do they denote the same thing? For a distribution in the context of being a downloadable file language has only meaning as the language used in the data. For a DataService language would mean the language of the terms in the payload. E.g. Consider a German API returning data in English, then the DataService language is German, but the dataset/distribution language is English. So merging both into one, means that one has to define what is the language property refers to. The argument behind this is that anyone, in particular a machine, should be able to retrieve the data that is needed by querying the structure and not by interpreting the combined (textual) values of multiple properties. Sentences like if there is a downloadURL and a format defined then we are considering a file based exchange and in that context the language is the language of the data. Better is then to create a subclass FileBasedDistributions in which the notions are made more clear.

Probably there is no black white in this but need for describing the cases where A or B are applicable. And then probably not all properties of DCAT are equally important in all cases. E.g. the language of the API is probably less important than the language of the data it is supplying.

andrea-perego commented 3 years ago

@fabiankirstein , to complement @bertvannuffelen 's points:

Some practical examples on what is being discussed are provided by GeoDCAT-AP, that has been supporting services since its first release in 2015. E.g.:

bertvannuffelen commented 2 years ago

During WG 15 sept 2021 and WG 21 Oct 2021, the wg was presented an approach to address the questions related to this issue. The result is an updated UML diagram and an usage guideline for Datasets, Distributions and Data Services.