Specific property for conditions for re-use instead of dct:rights in the context of NSIP/ESAP

SEMICeu / DCAT-AP

This is the issue tracker for the maintenance of DCAT-AP

https://joinup.ec.europa.eu/solution/dcat-application-profile-data-portals-europe

72 stars 24 forks source link

Specific property for conditions for re-use instead of dct:rights in the context of NSIP/ESAP #259

Open jakubklimek opened 1 year ago

jakubklimek commented 1 year ago

I am not sure whether this is the right place for this discussion (if not, feel free to move the issue elsewhere), but I am reading the "Technical recommendation for member states" regarding ESAP/NSIP and I have a concern regarding the usage of dct:rights for Conditions for re-use (Rights) (M).

See the dct:rights definition in DCAT, it is more generic:

A statement that concerns all rights not addressed with dcterms:license or dcterms:accessRights, such as copyright statements.

And the definition in Dublin Core even more so:

Information about rights held in and over the resource.

"Conditions for re-use" seems a lot more specific to me, especially since it is a requirement in the DGA itself, and therefore I think it deserves a special property, e.g. m8g:conditionsForReuse.

In terms of the SEMIC style guide, I think this should be a new property, or a case of Reuse of a property with semantic adaptations, but definitely not a case of just Reuse of a property with terminological adaptations.

There seems to be kind of an urgency, data.europa.eu seems to be set on implementing this as dct:rights. The argument is that 2 weeks is not enough time to come up with a proper solution in communication with SEMIC 🤷🏻‍♂️.

kuldaraas commented 1 year ago

+1.

The essence of the Data Governance Act is that for each published dataset description you describe explicitly:

the possibility to use anonymised / pseudonymised etc versions of the data;
the possibility to use a secure processing environment;
conditions that preserve the integrity of the functioning of the technical systems of the secure processing environment
confidentiality obligation that prohibits the disclosure of any information that jeopardises the rights and interests of third parties that the re-user may have acquired despite the safeguards put in place
information on fees.

As such, the information that needs to be recorded is fundamentally different from the scope of the PSI associated "rights" and I strongly support the proposal of adding a new property.

bertvannuffelen commented 1 year ago

@kuldaraas I agree that the DGA is asking information beyond what is usually covered by DCAT-AP's rights and licences.

In line with other comments on the legal side, I believe that DCAT-AP's rights and licences should be maintained to capture the usage expressions from the legal perspective. (thus in line with the current semantics).

However the DGA is also about data that is beyond the PSI directive. And this data may historically have no rights or licence descriptions, or was not metadata wise described because it was considered out of scope of a (open) data portal.

So it is good that we think over how can the information requirements expressed in the DGA can be fitted in DCAT-AP. E.g. fee information is typically something one could have in the licence: In order to reuse this dataset a fee is required. The DGA, however, specifies that fee information should be provided in such a wording that the rights and obligations are clear. Moreover, in this case, MS must provide a mechanism to calculate fees that is applicable in their jurisdiction.

So probably for each topic, the same pattern may arise. There is a right (or clause in the licence) that expresses generically the rules, and the more specific conditions are found in a separate document (e.g. r5r:rights_fees). The range of those properties are documents. which could be extended with some machine processable properties. But as indicated in the example for fees, that might be MS specific. Thus that would mean a investigation of all MS implementing legislation of this, before we can add something here.

Alternatively, we could go for dct:rights which range is a dct:RightsStatement and which could be interpreted as a Document. And in that case add a codelist to dct:RightsStatements, like r5r:dga_qualification with values {annonimisation|secureprocessing|...|fees} then the rights get classified according to the necessary chapters in the DGA.

The advantage of the last approach is that it is quicky applicable, and strongly ties the common approach of legal statements with the metadata properties in DCAT-AP for that usage. The drawback is that machine processable extensions are blocked. Properties for fees are different that those for secure processing etc. Another drawback is that the metadata cannot express easily constraints like for the DGA one must provide ... or any specific requirements for that aspect.

Based on these pros and cons I tend towards the latter.

jakubklimek commented 1 year ago

Assuming we can distinguish DGA datasets in a data portal (#260), the approach with a (DGA-specific?) codelist for classifying RightsStatemet documents attached using dct:rights seems feasible, and I do not even see the problem with:

Another drawback is that the metadata cannot express easily constraints like for the DGA one must provide ... or any specific requirements for that aspect.

if we view DGA datasets as separate datasets from the PSI datasets - then the rights on the DGA datasets apply only to those, and there can be PSI datasets with similar contents with other rights.

I would like to see a more machine-processable description of the rights, but that really seems unfeasible in the given timeframe and given the MS specifics.

H-a-g-L commented 1 year ago

If we agree that dct:rights is the correct property for denoting re-use conditions, I am still wondering if the dcat:Dataset domain shouldn't be replaced by dcat:Distribution. In fact, the DGA mostly refers to re-use of data - which I read as "the specific representation of a Dataset", i.e. Distribution (cfr. art. 5 of the DGA).

Considering that a Distribution

represents a general availability of a dataset. It implies no information about the actual access method of the data [...]

I suppose it would be semantically correct to associate re-use conditions with this class. In fact the example provided in the guidelines (6.1.3) provides dct:rights for the dcat:Distribution.

However, this leads to two more doubts:

In the guidelines, the conditions for re-use are to be relayed via dcat:accessURL. I do not find this to be the compatible with DCAT's definition of the property as:

the URL of a service or location that can provide access to this distribution, typically through a Web form, query or API call.

which, IMHO, is not the same as providing information on how to request access. For this, dct:rights of a dcat:Distribution(unless another specific property is created) would be a better choice. The dct:RightsStatement range for this property recommends

to refer to a rights statement with a URI.

which is compatible with how NSIPs will provide this information but would also allow for literal values if necessary.
If the conditions are not relayed via dcat:accessURL I imagine many such Distributions will not be able to refer a dcat:accessURL - which is a Mandatory property for DCAT-AP.

I guess the question is do we imagine that non-public datasets will have one reference to a document specifying the rights associated with re-use of the data and another one for specifying the concrete conditions that govern it (e.g. who may have access, if data needs to be anonymised first, limits for re-use and sharing).

jakubklimek commented 1 year ago

@ODP-hil That is an interesting point. From my point of view, the conditions of reuse are somehow similar to the license, which is also preferably attached to Distribution, not Dataset.

On the other hand, for non-public data, the distribution itself may not even exist at the time of creating the catalog record - in cases where the distribution is created on-demand, based on negotiations. From that point of view, the conditions of re-use would belong to the dataset as basis for negotiation about how to get access. Unless we want to make the publishers specify ahead of time which distributions of the dataset they support, i.e. in which data formats, including the mandatory byteSize, which may also be the case.

I do not see a conflict between the URL of a service or location that can provide access to this distribution and conditions for re-use - if we say that can provide access can mean that it tells you how to request access (regarding accessURL)

bertvannuffelen commented 1 year ago

@ODP-hil I am not sure if the objective of the DGA is to machine processable describe how a sensitive dataset can be accessed. In the practices I know, you need to get a token/key from the owner of the dataset to query for the data. In many cases, for each purpose a different token is provided (because of the GDPR). And for each purpose more or less data can be provided.

For instance to fine me when I drive with my 20 year old Diesel car, based on my vehicle licence plate, the owner and its residence address can be requested, but in that case it is illegal to request which are the owner kids. That information is hold in the same dataset, but GDPR blocks the request based on the purpose.

That example illustrates that while legal rights and the access conditions are on a abstract level the same, the practical execution can be very different, and thus it might be hard to expect the practical layer from the metadata.

H-a-g-L commented 1 year ago

@jakubklimek @bertvannuffelen thanks for the comments and examples. I consider my doubts resolved and agree with the suggested use of dcat:accessURL. For the range of dct:rights I can refer that the revised guidelines should indicate its use for Distributions (like in the example) and not for Datasets (as in the table)