cidgoh / DataHarmonizer

A standardized browser-based spreadsheet editor and validator that can be run offline and locally, and which includes templates for SARS-CoV-2 and Monkeypox sampling data. This project, created by the Centre for Infectious Disease Genomics and One Health (CIDGOH), at Simon Fraser University, is now an open-source collaboration with contributions from the National Microbiome Data Collaborative (NMDC), the LinkML development team, and others.
MIT License
92 stars 25 forks source link

CanCOGeN template: Purpose of sequencing tag corrections upon validation #238

Closed griffie closed 2 years ago

griffie commented 2 years ago

In the CanCOGeN template, sometimes data providers are copying and pasting from other files which may have purpose of sequencing terms that are not capitalized properly or have extra invisible spaces e.g. Screening for Variants of Concern (VoC) (this is the correct version) Screening for Variants of Concern (VOC) Screening for variants of concern (voc) etc...

These differences are not being flagged during validation, and are causing errors upon submission to the Data Portal, but also when the NML performs any data querying.

To fix this, upon validation, can values corresponding to our controlled vocab (spelled properly) be corrected so that the capitalization is as prescribed and any rogue spaces are removed? Right now this just needs to happen for the Purpose of Sequencing terms, as it hasn't been flagged as an issue anywhere else.

Thanks!

ddooley commented 2 years ago

We are going to implement a general fix which is that when users press "Validate" the categorical field values which only differ by capitalization will be quietly normalized to the capitalization present in the pulldown menu. (I presume this won't be a problem for any source databases since the validation is happening as a step towards full CanCoGen compatibility not source database compatibility.)

ddooley commented 2 years ago

Note this capitalization normalization adjustment applies across all templates.

ddooley commented 2 years ago

This is done and will be tested out on "validation-and-vocab-update" branch. It should be tested with bigger datasets to ensure performance is ok.