Simple way to add format to DataServices without the need of Distributions

init-dcat-ap-de commented 2 years ago

In https://github.com/w3c/dxwg/issues/1055 (and https://github.com/w3c/dxwg/issues/1381) there is the idea to use dcat:Distributions within dcat:Datasets, added to the dcat:DataService via dcat:servesDataset.

While this is possible, it doesn't seem suitable for the context of DCAT-AP and data portals. Datasets and Distribution come with their own set of mandatory elements. A portal normally also offers a page for every Datasets it contains. Those blank-node-datasets would pollute the catalog in my opinion.

Since dcat:endpointDescription has the cardinality of *, we could simply advise the following:

_:dataservice a dcat:DataService ;
  dct:title "A Title" ;
  dct:license <http://dcat-ap.de/def/licenses/dl-zero-de/2.0> ;
  dcat:servesDataset _:dataset-123 ;
  dcat:endpointURL <https://example.org/api/;
  dcat:endpointDescription <https://example.org/api/wfs?service=WFS> ;
  dcat:endpointDescription [ 
    dct:format <http://publications.europa.eu/resource/authority/file-type/JSON> ;
    dct:format <http://publications.europa.eu/resource/authority/file-type/XML> ;
  ]
.

Of course, we could formalize it by creating a class like dcatap:MicroEndpointDescription.

I think something like this might be a good idea for DCAT-AP. We have the narrower use case of data portals and there a simple and EU-wide solution would be useful. This blank-node (or the new class) could have other recommended properties, based on the needs of the portal user who searches for a DataService.

I would also question the need of Distributions in the context of a data portal, as in: A Dataset should only have downloadable Distributions and DataServices. The DataService is linked to the Dataset via servesDataset. A Distribution that soley links to the DataService does not add a lot of useful information.

grafik

(We are currently evaluating how DataServices can be modeled the best way and have not yet come to a conclusion...)

bertvannuffelen commented 2 years ago

@init-dcat-ap-de

In the regional profile DCAT-AP Flanders (Belgium) the technical information what a data service can do or provide have not been included as machine processible descriptions. There the idea was to enforce the current practices of data service design to ensure they document as much as possible the data service using the good practices of the API services communities. E.g. OpenAPI for RESTful services.

Whether or not the service speaks XML/RDF/json/JSON-LD is considered less important than the API is well documented. Therefore according to the DCAT-AP Flanders profile the dcat:endpointDescription is pointing to the OpenAPI or SOAP description documenting the API. The objective of the DCAT-AP Flanders profile is to make the datasets and dataservices more findable, enforcing the documentation efforts that already take place and not replacing good practices by a meta data approach.

With your proposal

_:dataservice a dcat:DataService ;
  dct:title "A Title" ;
  dct:license <http://dcat-ap.de/def/licenses/dl-zero-de/2.0> ;
  dcat:servesDataset _:dataset-123 ;
  dcat:endpointURL <https://example.org/api/;
  dcat:endpointDescription <https://example.org/api/wfs?service=WFS> ;
  dcat:endpointDescription [ 
    dct:format <http://publications.europa.eu/resource/authority/file-type/JSON> ;
    dct:format <http://publications.europa.eu/resource/authority/file-type/XML> ;
  ]
.

you are following a similar approach, namely stating there is an endpointDescription having specified the format JSON and XML. The challenge ahead is then to give meaning of the properties in an endpoint description. The question arises if we are re-expressing what is expressed in https://swagger.io/specification/ (Request body Object) or HTTP content negotation specification (https://www.gcloud.belgium.be/rest/#media-types) (or even https://www.hydra-cg.com/spec/latest/core/).

It is an open question if this machine readable decomposition aids the data community or if we would rather try as Open Data Portals to engage developers in publishing quality API documentation (or self documentating API interfaces). The API documentation is moving fast, new properties and details are emerging at a high speed, and then I think it is simpler and better for the data community as a whole to endorse good API documentation practices. This does not exclude we could lift some information from the API specifications, or vice versa, impose restrictions on the information in the API documentation (see e.g. the Info Object in the OpenAPI specification, this is exactly the information DCAT-AP captures and thus should be inline with eachother). But today I personally would be happy that APIs get documented properly (e.g. in a managed html representation). That is already a challenge in its own right.

Some references (in Dutch though):

bertvannuffelen commented 2 years ago

On distributions, I personally do not see them as auxiliary means to describe the content of a dataservice. For me distributions are intentionally shared representations of a dataset. In a dataservice the representation of the data can change over time (one can add new properties, change the structure etc...) reflecting this all in a distribution I would not do because this is the goal of the data service to release me from those fixed agreements.

For me a distribution should should support offline use. If the intend is that the data can only be used online (with an active network connection) I do not expect to have an distribution. Then there is solely a data service.

Therefore even the dataservice allows dumping the whole content on disk (e.g. a sparql endpoint) but if this is not the intend of the dataservice, I would not describe a distribution of that dataset. Users might start to use it in a way which because of the current size is still feasible but later on not.

init-dcat-ap-de commented 2 years ago

I agree to your statement about Distributions.

It is an open question if this machine readable decomposition aids the data community or if we would rather try as Open Data Portals to engage developers in publishing quality API documentation (or self documentating API interfaces). The API documentation is moving fast, new properties and details are emerging at a high speed, and then I think it is simpler and better for the data community as a whole to endorse good API documentation practices.

I don't see this as either-or-not, the API should be well documented but the data portals should also get the metadata, researchers are interested in, when searching for a suitable API.

To be honest, that dct:format is an interesting datapoint, is only a guess. In an ideal world, we would validate the need of the data portal users.

init-dcat-ap-de commented 2 years ago

data.europa.eu has already implemented dcat:DataServices: (e.g.: https://data.europa.eu/data/datasets/eu-whoiswho-the-official-directory-of-the-european-union?locale=en )

grafik

As far as I can see it, the existence of the class "dcat:DataService" does not provide a lot of additional information. All information in the class is already coded into the Distribution (dcat:endpointURL can be found in the dcat:accessURL, the fact that the Distribution is delivered by use of dcat:accessService).

mayaborges commented 2 years ago

During the ongoing work on the Danish national data portal we have also encountered this need.

Our user tests have in fact directed our focus on the usefulness of filtering datasets according to the data formats in which the data is available.

Building on DCAT-2.0, if a group of datasets are available in x different data formats, we then put an emphasis on getting x distributions added to each datasets (because each distribution can only have one data format).

However, some of these “distributions” are actually shared data services, e.g. a shared API to a group of datasets. Furthermore, these data services are often designed to give the relevant data from a given dataset in the data format requested by the user, meaning that the "distributions" don't exists, except when asked for. And, to continue on the point made by @bertvannuffelen , that seems to stretch the definition of 'distribution' beyond its breaking point.

Aside from the semantics being questionable and thus possibly leading to confusion, having to add 5 distributions to over 1000 datasets with no additional information beyond format, rather than saying that the service that serves them can provide 5 different formats, is just inelegant and does present us with some issues in presenting the metadata to the user. But that is what we have have so far been forced to do.

So we would also be very interested in being able to add formats to data services, or to have other suggestions on how to solve this issue.

NatasaSofou commented 2 years ago

A similar need has been expressed for BregDCAT-AP Issue#1 and Issue#2

init-dcat-ap-de commented 1 year ago

Solution Summary: DCAT-AP 3.0.0 adds dc:format [0..*] to dcat:DataService.

https://semiceu.github.io//DCAT-AP/releases/3.0.0%23DataService.format grafik

SEMICeu / DCAT-AP

Simple way to add format to DataServices without the need of Distributions #217