Interoperable-data / ERA-Ontology-3.1.0

Extended version of the ERA Railway Infrastructure Ontology
4 stars 4 forks source link

clean and normalize ERA SKOS vocabularies #81

Open VladimirAlexiev opened 1 month ago

VladimirAlexiev commented 1 month ago

A lot of the thesauri in https://github.com/Interoperable-data/ERA_vocabulary/tree/main/era-skos are just dumps of values retrieved from the RINF and ERATV databases, with no effort at curation, normalization or merging.

Here are some examples and doubts from era-skos-PlatformHeights.ttl:

Expressing a combination of numbers as a Concept is wrong. Instead, the possibilities should be expressed in some RDF constructs, which should be checked with SHACL. For example:

Codifying an uncontrolled RINF/ERATV field as a skos:ConceptScheme is actually harmful since it gives data providers a bad signal that they should use these unclean values in the data they export.

I appreciate that cleaning up all these values is a big task. But the longer it's postponed, the greater harm to interoperability. It should be based on a solid plan:

gatemezing commented 2 weeks ago

This is actually a work in progress and some improvements will be visible in v3.1.0. In real-world scenario, and sometimes in big ORG, there is trade-off to meet between publishing data vs taking time to clean data before publishing. Data quality is a work in progress once we have a good governance in place. Thank you for the feedback.

VladimirAlexiev commented 2 weeks ago

@gatemezing great! Is there a scoping doc or spec that you can share?

gatemezing commented 1 week ago

No spec for now. There is a process to establish because we are talking about legacy data, with a field "other value" in the registry where the applicants could put whatever they wanted. There will be at least a first clean up in the official release of v3.1.0