cidgoh / DataHarmonizer

A standardized browser-based spreadsheet editor and validator that can be run offline and locally, and which includes templates for SARS-CoV-2 and Monkeypox sampling data. This project, created by the Centre for Infectious Disease Genomics and One Health (CIDGOH), at Simon Fraser University, is now an open-source collaboration with contributions from the National Microbiome Data Collaborative (NMDC), the LinkML development team, and others.
MIT License
91 stars 25 forks source link

UX of changing ontology IDs #277

Open cmrn-rhi opened 2 years ago

cmrn-rhi commented 2 years ago

As we deprecate temporarily minted GENEPIO terms for the preferred domain ontology term, we will run into users having validation/consistency errors if they don't manually update to the new IDs (which they may not realize they have to do).

We need to facilitate updating IDs in a dataset, be it via DataHarmonizer or GEEM, taking advantage of the "term replaced by" operation.

ddooley commented 2 years ago

One helpful file in that regard is a separate term deprecation file containing all the deprecated terms and a reference to the replacement term. This can be an import file into the main ontology. This has been setup for FoodOn but not for Genepio yet. Also, we could have a tabular version of this file for easy SQL query or other database update use.

ddooley commented 9 months ago

This suggests a feature for dataset management perhaps via dataharmonizer, namely an option to supply dataharmonizer with a list of mappings from deprecated to replaced values, such that dataharmonizer could do the conversion on any given dataset. It suggest a single pooled conversion table resource (assuming everyone agrees on appropriate replacements).

cmrn-rhi commented 9 months ago

One helpful file in that regard is a separate term deprecation file containing all the deprecated terms and a reference to the replacement term. This can be an import file into the main ontology. This has been setup for FoodOn but not for Genepio yet. Also, we could have a tabular version of this file for easy SQL query or other database update use.

Yeah, we currently have a deprecation import but it's only for deprecations that were pulled off of ROBOT imports. So if we extract the integrated deprecations then we can just merge and manage them all on a separate deprecation import.

This suggests a feature for dataset management perhaps via dataharmonizer, namely an option to supply dataharmonizer with a list of mappings from deprecated to replaced values, such that dataharmonizer could do the conversion on any given dataset. It suggest a single pooled conversion table resource (assuming everyone agrees on appropriate replacements).

Yes, this would be great. My recollection is so far replacements have been across specifications, we haven't had a case where replacement occurred for one but not another.

cmrn-rhi commented 9 months ago

So this is a joint GENEPIO / DataHarmonizer (or pathogen-genomics-package) issue, with the latter being the need for the conversion table. I can add the generation of the conversation table from GENEPIO scripts as part of the specification ontologizing script, but it would be helpful to know what the format of the conversation table should/might be.