A standardized browser-based spreadsheet editor and validator that can be run offline and locally, and which includes templates for SARS-CoV-2 and Monkeypox sampling data. This project, created by the Centre for Infectious Disease Genomics and One Health (CIDGOH), at Simon Fraser University, is now an open-source collaboration with contributions from the National Microbiome Data Collaborative (NMDC), the LinkML development team, and others.
MIT License
97
stars
27
forks
source link
New command line dh-validator.py tool for validationg csv,tsv,xls,xlsx data files against a schema.yaml file #445
A new command-line dh-validate.py script simplifies the validation of DataHarmonizer-generated csv,tsv,xls,xlsx files. We look forward to feedback on using this below.
Basically, the linkml-validate command is good for the .json or .yaml data format, but the tabular csv,tsv,xls,xlsx input formats often don't validate well for two main reasons which are resolved by dh-validator.py generating a temporary .yaml file version of the tabular input with necessary adjustments made according to the given schema. dh-validator.py then sends this to linkml-validate for processing. The following adjustments are made:
Column labels in DataHarmonizer data files are usually the slot/field titles, rather than the LinkML standard (codewriter compatible) names of slots. This script maps the appropriate column names in temporary .yaml file.
Multivalued slots/fields in such data files (from multi-select menus or a combination of menus and/or a string or other input element) get their values converted into an array of values in the temporary .yaml file. The semicolon and vertical bar delimiters (";|") are observed here.
One "gotcha" that takes some explaining is that dh-validate.py requires that picklist enumerations (enums) in the given schema) have been named according to LinkML standard naming practice. To explain: linkml-validate renames any schema slot and enumeration menu names that haven't used LinkML standard naming into its version of standard names. While we have added a conversion to ensure that the temporary .yaml file contains a linkml-validate compatible rename of a field, if that field mentions an enum in its range, that name is also renamed by linkml-validate into standard form - but LinkML isn't renaming that enum everywhere it occurs in the schema itself, and so linkml-validate will fail with a long error beginning: "jsonschema.exceptions._WrappedReferencingError: PointerToNowhere: '/$defs/GeoLocName(state/province/territory)Menu' does not exist within {'$schema' ... " Since we don't want to revise the given schema.yaml, we have to insist that the schema holds an standard-named enums.
We will be evolving this script to give a report of any miss-matched columns/fields, to facilitate having older tabular data validated in a newer LinkML schema version for example.
A new command-line dh-validate.py script simplifies the validation of DataHarmonizer-generated csv,tsv,xls,xlsx files. We look forward to feedback on using this below.
Basically, the linkml-validate command is good for the .json or .yaml data format, but the tabular csv,tsv,xls,xlsx input formats often don't validate well for two main reasons which are resolved by dh-validator.py generating a temporary .yaml file version of the tabular input with necessary adjustments made according to the given schema. dh-validator.py then sends this to linkml-validate for processing. The following adjustments are made:
We will be evolving this script to give a report of any miss-matched columns/fields, to facilitate having older tabular data validated in a newer LinkML schema version for example.