Versioning datasets - Githubissues

bfrichet3 commented 2 years ago

Dear

I am coming here to report a big concern I have about the DCAT AP profile. My concern is about the way DCAT AP feeds should manage different versions of one dataset. As DCAT AP designed it many versions of the same dataset do not be listed by one Dataset Class but should be listed by many Dataset classes.

I strongly disapprove that way of modelling metadata for three reasons at least:

1) Most of existing data portal do not work like that, at least in Belgium. As you may read there (https://data.gov.be/nl/dataset/bd715280-d41e-11eb-b22d-7478273ff935), with one Dataset class, one can get access to many versions of the same data (years 2020, 2019, ...). In that way, the proposal go against the de facto used standard (in Belgium).

2) If you use different Dataset Classes to reference different versions, you will have to write again the same metadata-elements for each dataset you add in your catalog. You will have to write tile, summary, responsible party many times. In my experience it is a very bad idea to so, because there will be some inconsistencies between the way of writing these elements. For instance if I have two datasets concerning Administrative Units in 2021 and in 2022 an, I will have these two differents titles: 1) Administratives units (2021); 2) Adminstratives Units - 2020.

This would go against Only Once principle and that will lower the internal quality of our catalogs our final clients will experience.

3) If you use different Dataset Classes to reference different versions of a same dataset, you will artificially inflate the number of datasets in your catalog and the number of HTML pages in your human readable data portal. At the end, it will be less understandable for the final users of our catalogs.

Therefore, I strongly recommand not to mulitply Dataset classes to reference many versions of one same conceptual dataset and to use many Distribution classes as suggests that comment (https://github.com/SEMICeu/DCAT-AP/issues/197) you closed.

Regards,

Benoît

bertvannuffelen commented 2 years ago

@bfrichet3 thank you for this contribution.

A first comment is that we would like you to invite to post this issue at W3C (https://github.com/w3c/dxwg/issues) because we have post-poned making any decision on versioning and series before W3C has finalised its discussion.

Secondly, I think we should not mix data quality with the DCAT-AP profile. The DCAT-AP profile is to ensure that we understand the same thing when we share an entity dataset/distribution/dataservice among us. Duplicates/variants/... etc are beyond the profile and are part of the implementations of an ecosystem.

And unfortunately I think this a battle we cannot win nor should aim to win. E.g. if a dataset moves after 5 years to another organisation (e.g. a Digital archive) then the title will not change for eternity. So then if new guidelines/updates come to the title of the "living" dataset then you get differences. Or if Eurostat publishes an aggregated dataset of a data from a MS, then the MS part can be different because in the mean time the MS has more recent data. Even if there is no data difference, the name of the file at the MS might be different than the one Eurostat is using. This kind of cross-organisation versions/variants/series exist and therefore enforcing through a (DCAT-AP) profile a uniform naming convention and labeling is impossible.

Nevertheless we should ensure that the DCAT-AP profile (users) have the potential to provide the information that discovery of closely related entities is possible. For instance, a good practice (even though this is again an implementation choice) would be that every entity gets a persistent web-accessible identifier (PURI). When that would be available, then one can create a cross-organisational network because one can rely on the stability and existence of the identifier. So in my example cases, the living dataset can point to the archive as part of a serie; the Eurostat dataset can point to the MS dataset as one of its sources. But again, the profile will not be able to enforce PURIs as this is considered for implementations. Unless we as community endorse an implementation guide PURIs are a good best-practice.

A quick note on Once-Only: I am not sure how to fit this in the story metadata story, because a single data entity is being shared throughout multiple channels and formats, often aggregated with various different other sources. All these are part of our catalogues, so the base governmental data (for instance the administrative units of a country) are exposed probably via thousands of datasets/services. (e.g. Eurostat CSV's contain the administrative units of a MS). I can imagine that a company that builds a baseline visualisation on the Eurostat CSV might use the administrative units inside the Eurostat CSV instead of downloading the original sources of the MS individually. So I am interested what Once-Only means for you w.r.t. metadata descriptions, as for me I considered it more an aspect of the actual data & actual data exchange, rather than a story for metadata.

bertvannuffelen commented 5 months ago

As explained in the past webinars we will follow the W3C DCAT approach for Dataset Series.

SEMICeu / DCAT-AP

Versioning datasets #203