CanCOGeN template: Purpose of sequencing tag corrections upon validation

griffie commented 2 years ago

In the CanCOGeN template, sometimes data providers are copying and pasting from other files which may have purpose of sequencing terms that are not capitalized properly or have extra invisible spaces e.g. Screening for Variants of Concern (VoC) (this is the correct version) Screening for Variants of Concern (VOC) Screening for variants of concern (voc) etc...

These differences are not being flagged during validation, and are causing errors upon submission to the Data Portal, but also when the NML performs any data querying.

To fix this, upon validation, can values corresponding to our controlled vocab (spelled properly) be corrected so that the capitalization is as prescribed and any rogue spaces are removed? Right now this just needs to happen for the Purpose of Sequencing terms, as it hasn't been flagged as an issue anywhere else.

Thanks!

ddooley commented 2 years ago

We are going to implement a general fix which is that when users press "Validate" the categorical field values which only differ by capitalization will be quietly normalized to the capitalization present in the pulldown menu. (I presume this won't be a problem for any source databases since the validation is happening as a step towards full CanCoGen compatibility not source database compatibility.)

ddooley commented 2 years ago

Note this capitalization normalization adjustment applies across all templates.

ddooley commented 2 years ago

This is done and will be tested out on "validation-and-vocab-update" branch. It should be tested with bigger datasets to ensure performance is ok.

cidgoh / DataHarmonizer

CanCOGeN template: Purpose of sequencing tag corrections upon validation #238