SEMICeu / DCAT-AP

This is the issue tracker for the maintenance of DCAT-AP
https://joinup.ec.europa.eu/solution/dcat-application-profile-data-portals-europe
72 stars 24 forks source link

[HVD 2.2] Dataset - dataset distribution, wrong cardinality (?) #376

Open jimjyang opened 1 month ago

jimjyang commented 1 month ago

In DCAT-AP HVD v.2.2.0, the cardinality for the property dataset distribution (dcat:distribution) in the class Dataset (dcat:Dataset) is 1..*, while it is "recommended" in the table under "A. Quick reference ...".

We support "recommended", since not all HVD datasets are required to be made available in bulk download, but may also through APIs, as we understand HVD IR.

Thus, the cardinality should be 0..*.

jakubklimek commented 1 month ago

@jimjyang in the case of API, you have Dataset ---distribution---> Distribution ---accessService---> DataService, so even in this case, there will be a distribution. And it does not make sense to mark as HVD something, that does not have a distribution.

jimjyang commented 1 month ago

@jakubklimek A Dataset may also be made available through an API: DataService -- servesDataset ---> Dataset.

What I was thinking, was if I should also raise an issue about a missing inverse property of dcat:servesDataset so that you may from a Dataset directly see which DataService gives the access to the Dataset. W3C doesn't have this inverse property.

jakubklimek commented 1 month ago

@jimjyang I see. In that case, it would also be interesting to include some sort of guidance on when a publisher should use which representation - with or without distribution, and possibly adjust the cardinality of dcat:distribution accordingly. But if I recall correctly, HVD does not include standalone Data Services, only those distributing a HVD dataset, and therefore the distribution would be required.

With the inverse property, again, I do not see the problem of using servesDataset, as we are talking about RDF, so there should be no problem in including a triple

<DataService> dcat:servesDataset <Dataset>

in a response to request for data about <Dataset>

jimjyang commented 1 month ago

@jakubklimek As we understand HVD IR, it specifies the requirements for when a HVD must be in both bulk download and API, or only one of them.

In addition to the requirements in HVD IR, the bullet points in 14.1 Usage guide on Datasets, Distributions and Data Services in DCAT-AP are absolutely useful.

jimjyang commented 1 month ago

@jakubklimek Reg. the need for an inverse property of dcat:servesDataset

It is of course no problem with the direction dcat:servesDataset .

The other direction: when a re-user discovers <dataset1>, if <dataset1> doesn't have that missing inverse property, how could the re-user know about <dataService1>, by only reading the description of <dataset1>?

Same difficulty goes for the reporting: in order to know if an HVD marked dataset is made available and when it doesn't have any distribution (bulk download), you need to either 1) put in that missing inverse property in the description of that dataset at the time of harvesting/registration, or 2) look through all dataServices that you have in your catalog (which may be many, many, many more than the API(s) for this particular dataset), every and each time you do the reporting (which will be resource consuming which is actually unnecessary), or how do you do that?

jakubklimek commented 1 month ago

@jimjyang When asking for description of <dataset1>, the response can contain e.g.

<dataset1> a dcat:Dataset ;
   #...more dataset info .
<dataservice1> dcat:servesDataset <dataset1> .

This is not connected to the actual direction of dcat:servesDataset . The reuser needs to know that <dataset1> can appear both in the subject and the object position of the returned RDF triples.

What you are describing seems like a limitation of some particular API, that it is unable to include data where the dataset would be in the object position?

jimjyang commented 1 month ago

@jakubklimek My concern is, how do you (the machine) know it is only <dataService1> but not some of the other thousands of dataServices in the catalog, which also should be included in the response, without using unnecessary machine resources to look through all dataservices in the catalog?

jakubklimek commented 1 month ago

@jimjyang Well, I am not sure about what you mean by unnecessary machine resources. E.g. given that my data catalog is in a SPARQL endpoint, I can simply query like this:

SELECT * WHERE {
?dataset a dcat:Dataset .
?dataService a dcat:DataService .

#and either
?dataService dcat:servesDataset ?dataset .
# or this or an explicit URL for the inverse:
?dataset ^dcat:servesDataset ?dataService.
}

and it makes no difference.

There may be APIs where this makes a difference, but then I would say that it is up to that API to adapt for this situation.

There is still the possibility I still do not understand the issue though :)

jimjyang commented 1 month ago

@jakubklimek "I am not sure about what you mean by unnecessary machine resources": What I mean is: without this missing inverse property in use in the description of a given Dataset, it will require more machine resources (computing resources) to find all the DataServices that provide access to this given Dataset. We can talk about why I mean that some other time if you still don't agree :-).

Back to the issue - sorry that i mixed two different discussions in one issue.

  1. About the cardinality for the property dataset distribution (dcat:distribution) in the class Dataset: 1.a. There is an inconsistency between the specification of that property (cardinality 1..*) and annex A ("recommended"). 1.b. We support "recommended" and propose thus to change the cardinality to 0..*.
    1.c. See also a similar issue on dcat:servesDataset, https://github.com/SEMICeu/DCAT-AP/issues/378

  2. Missing inverse property of dcat:servesDataset: DCAT specifies inverse property for "every other property" but dcat:servesDataset. So, what property should be used when one for some reasons needs to use an inverse property in Dataset (of course in addition to dcat:servesDataset in the DataService)?

bertvannuffelen commented 3 weeks ago

Back to the issue - sorry that i mixed two different discussions in one issue.

1. About the cardinality for the property [dataset distribution (dcat:distribution)](https://semiceu.github.io/DCAT-AP/releases/2.2.0-hvd/#Dataset.datasetdistribution) in the class Dataset:
   1.a. There is an inconsistency between the specification of that property (cardinality 1..*) and annex A ("recommended").
   1.b. We support "recommended" and propose thus to change the cardinality to 0..*.
   1.c. See also a similar issue on dcat:servesDataset, [[HVD 2.2] DataService - servesDataset, inconsistency with "A. Quick reference" #378](https://github.com/SEMICeu/DCAT-AP/issues/378)

I agree with the inconsistency. Nevertheless there is here some legal aspect: I checked the Annex and in almost all cases a Bulk Download is required. Thus the min-cardinality is in 95% of the cases obliged. (See for an exception in Annex 3.2). From a policy point of view a bulk download is more mandatory than recommended.

Therefore lets drive towards this goal and accept the exceptions for now. And see after the first reporting in feb 2025 how many comply to it.

My proposal is to make in the quick annex thus the property mandatory in case for HVD.

I also like to note that HVD is a subset of the Open Data, it is not the norm for all datasets.

2. Missing inverse property of dcat:servesDataset: [DCAT specifies inverse property for "every other property"](https://www.w3.org/TR/vocab-dcat-3/#inverse-properties) but dcat:servesDataset. So, what property should be used when one for some reasons needs to use an inverse property in Dataset (of course in addition to dcat:servesDataset in the DataService)?

Having read the whole exchange, I still do not grasp the need for an inverse property. Belgium will supply Data Services that serve datasets using the DCAT-AP model and there is no issue their. So maybe Jakub and I are missing some specific implementation context why it is needed.

jimjyang commented 2 weeks ago

Concerning the HVD IR requirement on bulk download: With "recommended", you may still explain the policy by adding a new section under "10. Mapping the HVD IR to DCAT-AP" (similar to, or together with, what I suggested in the other issue about DataService).

Concerning the inverse property of dcat:servesDataset: I didn't mean to include explicitly such an inverse property in DCAT-AP HVD, but I miss this inverse property listed together with the other inverse properties such that it may be used when needed. The specification is incomplete with this inverse property missing.