Open VladimirAlexiev opened 1 month ago
This is actually a work in progress and some improvements will be visible in v3.1.0. In real-world scenario, and sometimes in big ORG, there is trade-off to meet between publishing data vs taking time to clean data before publishing. Data quality is a work in progress once we have a good governance in place. Thank you for the feedback.
@gatemezing great! Is there a scoping doc or spec that you can share?
No spec for now. There is a process to establish because we are talking about legacy data, with a field "other value" in the registry where the applicants could put whatever they wanted. There will be at least a first clean up in the official release of v3.1.0
A lot of the thesauri in https://github.com/Interoperable-data/ERA_vocabulary/tree/main/era-skos are just dumps of values retrieved from the RINF and ERATV databases, with no effort at curation, normalization or merging.
Here are some examples and doubts from
era-skos-PlatformHeights.ttl
:Expressing a combination of numbers as a Concept is wrong. Instead, the possibilities should be expressed in some RDF constructs, which should be checked with SHACL. For example:
Codifying an uncontrolled RINF/ERATV field as a skos:ConceptScheme is actually harmful since it gives data providers a bad signal that they should use these unclean values in the data they export.
I appreciate that cleaning up all these values is a big task. But the longer it's postponed, the greater harm to interoperability. It should be based on a solid plan: