dcat:Distribution may also be an embedded entity

pietercolpaert commented 8 months ago

After discussion with the people behind Piveau, it appears distributions are oftentimes blank nodes. In that case, the distribution cannot be a standalone entity, because for standalone entities a named node is required.

I propose to add in the spec that distributions can also be blank nodes on the condition they become embedded entities in the dcat:Dataset.

matthiaspalmer commented 8 months ago

I strongly disagree here for several reasons.

First, DCAT-3 specification says that:

Section 5.2 (RDF considerations): "... it is recommended that instances of the DCAT main classes have a global identifier, and use of blank nodes is generally discouraged when encoding DCAT in RDF."
Section 5.1 (DCAT Scope): Lists dcat:Distribution as one of seven main classes.

Second, handling updates of distributions with blank nodes is tricky for implementors, I see three main approaches:

You always keep the distributions in the same named graph as otherwise you will need to distinguish them every time which is error prone.
Depending on if it is a blank node or distribution you store the information in the same or in a separate named graph.
You try to normalize and introduce URIs in the harvesting step when there are blank nodes.

Clearly, option 2 is least desirable as it leads to complicated harvesting and also complicated requests in the frontend. Option 1 sounds good on paper, but the risk is that the problem spreads. (I think have happened on data.europe.eu as even the data services are stored in the same graph as they are reachable from a distribution. This leads to a lot of duplication of triples and consequently much harder to provide a view of dataservice that have a more independent character. I am unsure if it has also spread to contactpoints and publishers.)

Hence, the solution of option 1 has the risk of spreading it's bad influence, causing other problems. What if a data service is represented as a blank node and referenced from many datatasets?

I think option 3 is the best option as it treats the blank nodes as wrong and mints new URIs based on a certain mechanism in the harvesting step that keeps the minted URIs at least semi stable. The intent of option 3 is push back on data publishers that use blank nodes and hope that with time we can be more strict in what we accept in the harvesting step.

Basically what I am saying is: When we have the chance of defining a new protocoll, let's design it in a way that forces people to solve problems earlier in the chain rather than etching the problems into the protocoll itself.

pietercolpaert commented 8 months ago

I would also still see it as strongly discouraging it, but still it’s possible, hence we need to make sure it works. Having it as a fallback might be useful.

matthiaspalmer commented 8 months ago

But where do we draw the line, which standalone entities should be allowed to be provided as blank nodes? Why only Distributions?

I would suggest a motivating underlying rule that says that standalone entities that might be reused (pointed to by more than one triple) should always be required to appear with URIs in separate named graphs.

From this rule I think distributions are the only standalone entities that would be allowed (although discouraged) as blank nodes.

pietercolpaert commented 8 months ago

I like this wording and agree!

pietercolpaert commented 7 months ago

I’ll close this discussion as final now: the spec now points out in a note that using a dcat:Distribution like this won’t break anything, but we don’t see it as a good practise.

SEMICeu / LDES-DCAT-AP-feeds

dcat:Distribution may also be an embedded entity #6