SEMICeu / DCAT-AP

This is the issue tracker for the maintenance of DCAT-AP
https://joinup.ec.europa.eu/solution/dcat-application-profile-data-portals-europe
72 stars 24 forks source link

Provide more guidance on the relation between datasets and data services #273

Open matthiaspalmer opened 10 months ago

matthiaspalmer commented 10 months ago

We appreciate that chapter 14 provides guidance on when to use datasets and data services. For instance the statement "Datasets are the conceptual entity denoting a collection of data." is clarifying.

In our understanding of DCAT and the statement above, all data services that are not purely "data processing services" should be explicitly related to one or several datasets. There are also practical reasons for expressing relations between datasets and dataservices, e.g. a dataset may be available both as a download (via a distribution) and as a service.

The specification states how to use the properties dcat:servesDataset and dcat:accessService. However, it is not entirely straightforward to understand when to use them just by reading chapter 14. There are also other properties that should be used in a special way when relations are expressed, e.g. should there be a dcat:downloadURL when there is a dcat:accessService?

Hence, we would suggest to add the following guidance:

  1. Unless a data service is a pure "data processing service" at least one dataset in the same catalog should refer to it in a distribution via the property dcat:accessService.
  2. Pointing directly to a dataset from the data service via dcat:servesDataset is allowed but not necessary.
  3. If your data service provides data in multiple formats you should express that by repeating the dcterms:format on the dataservice. You may provide one distribution per format, but it is also acceptable to only provide one distribution with the format that is most widely used. (This is useful when you want to provide rich metadata on the distribution and maintaining many distributions only differing on the format with all other metadata fields repeated proves to be administratively challenging.)
  4. Distributions that refer to a data service via dcat:accessService should never provide a dcat:downloadURL.
  5. If your data service serves only one dataset the dcat:accessURL on the distribution and the dcat:endpointURL should be the same.
  6. If the data service serves many datasets the dcat:accessURL may be more specific than the dcat:endpointURL of the dataservice, but only if it corresponds to a way to filter the data so it corresponds to the dataset at hand.
bertvannuffelen commented 8 months ago

@matthiaspalmer I understand these guidelines but I suspect that they are too strict in many cases.

For each case a counter example:

  1. In Belgium, there are cases that the Dataset is provided by the federal level, but the access API is provided by the regional level. The metadata of each is in different catalogues.

  2. for the example of Belgium, pointing to the Dataset from the service out is easier than to a specific Distribution. Also in the region of Flanders, everything is mostly API first and thus Distributions (a downloadable full collection) is often not provided. In particular when it is closed or sensitive data.

  3. This is mostly about the capabilities of the service. Again if one considers this from an API first design, then the format is an arbitrary choice. Typically the business context might fix XML or Json. But any other format is typically a technical transformation on top, interpreting e.g. the accept header. Therefore it is more a capability of a service rather than a connection with the distribution. For instance: I know REST APIs with JSON payload that offer on the site a complete download file (SHAPE-file).

  4. In case of geospatial services this is important: a single URL service can hold many datasets, each in one layer. In that case the downloadURL is documenting the "exact parametrisation" in the service.

  5. The file download URL of a distribution could be on a big cloud drive (we-transfer like) while the service has a different domain. I would not assume both are the same.

  6. Similar to case 5.

In general your guidelines correspond to a uniformisation of APIs and File Dump Distribution sharing for Datasets. For me that is not the objective of DCAT-AP. We cannot impose how an API must be developed, one which domain it must be and how it should be connected with the accessURL of the Distribution. We however can ensure that they are properly documented.

I hope you can see that in case a project decides only to provide an API for a Dataset that the documentation of a Distribution is a virtual thing. Unless there is a need to identify a concrete snapshot or similar a distribution is not part of the metadata. In general your guidelines make strong assumptions on the connection between a service and a file distribution.

If Sweden can impose this in general to all agencies and data platforms you have a strong harmonisation power. But my opinion is that this is beyond DCAT-AP to impose such strict guidelines.

bertvannuffelen commented 5 months ago

@matthiaspalmer, do you see any aspects to continue this issue?

At this moment it is a two person conversation, and I do not see yet how the proposal can lead to guidelines/constraints that are restricting the application of DCAT-AP in its scope.

Otherwise the proposal is to close the issue with the release 3.0.0

matthiaspalmer commented 5 months ago

@bertvannuffelen I originally thought that the suggesations could be seen as recommendations / clarifications / guidelines for those that feel a bit unsure of how to express the combination of datasets and data services.

I still think that they would make sense to have in chapter 14. (Two persons have also given their thumbs up, even if they have not responded in more detail.) Maybe if the points are clarified further that it is a recommendation with SHOULD and MAY, no MUST?

I comment on examples, point by point:

  1. I understand, I did not consider this. However, I think your example is not the most common one, so the wording SHOULD instead of MUST still makes the point valid?
  2. This is not a counter example, I only stated that the relation in this direction is not a requirement (compare with the similar discussion about inverse relations).
  3. Yes, this is about the capabilities of the service and clarifying that you are not required to provide a distribution for every format that the service provides. I know people feel this need today, so clarifying it seems like a good idea. Providing supported formats on the data services is an option, not a requirement, perhaps change the wording to MAY instead of SHOULD here.
  4. I disagree with you here. The accessURL should provide the parametrization, not the downloadURL. In fact, I think it is wrong to provide a downloadURL at all here (unless the URL gives you ALL data in a single request which is seldom the case when it comes to geospatial services).
  5. I disagree here as well, I would point to the huge downloadable file via a separate distribution rather than try to combine it with the distribution corresponding to the data service. Even if they have the same format, the access mechanism is different and would benefit from separate metadata descriptions. I think this is a good rule of thumb to always separate them for this purpose.
  6. See 4.
bertvannuffelen commented 5 months ago

@matthiaspalmer I agree there is in the broad audience a feeling that some more guidance is needed.

But I believe at that this moment your suggestions get to easily challenged. Meaning that they become more like Consider rather than MAY/SHOULD/MUST statements.

My hesitation with these guidelines comes from that this impacts the design of existing data sharing applications. So not the metadata, but the actual data sharing platforms. If the metadata is imposing rules like: the download URL must be on the same domain as the API URL, then this is not a rule for metadata, but a rule that real systems must implement. And I think this is beyond what DCAT-AP has as mandate. Even if it is added as suggestion, it will create a metadata editor to question the system architecture. That is not a fair situation to put the metadata editor in.

In addition, these guidelines will get challenged furthermore when considering personal data. There many hurdles are between the notion of a Dataset and the concrete access (Distribution/DataService). As the objective of DCAT-AP is to support all Data Spaces I am more reluctant to impose now detailed guidelines between values of one class DataService and another Distribution.

I think one of the aspects in the formulation that makes me reluctant is that the rules are very universal: every should apply them. Because we are somehow with these rules harmonize a specific usage pattern. We should therefore first describe the usage patterns and then maybe provide guidelines for them.

  1. A Dataset without any public access to the concrete data
  2. A Dataset with only file dumps
  3. A Dataset with only API access
  4. A Dataset with a mixture of file and API access.

In the last usage pattern there is a common occurance that one team is responsible for the whole access provisioning of the Dataset. It results often is a dedicated data access for that Dataset; which additionally has the effect that Distribution accessURL and/or downloadURL are strongly connected with the DataService endpointURL.

This attempt is still very shallow and not yet close to the rules you proposed. But I hope this already gives a sense of what we can describe.