cidgoh / DataHarmonizer

A standardized browser-based spreadsheet editor and validator that can be run offline and locally, and which includes templates for SARS-CoV-2 and Monkeypox sampling data. This project, created by the Centre for Infectious Disease Genomics and One Health (CIDGOH), at Simon Fraser University, is now an open-source collaboration with contributions from the National Microbiome Data Collaborative (NMDC), the LinkML development team, and others.
MIT License
97 stars 27 forks source link

New command line dh-validator.py tool for validationg csv,tsv,xls,xlsx data files against a schema.yaml file #445

Closed ddooley closed 1 month ago

ddooley commented 2 months ago

A new command-line dh-validate.py script simplifies the validation of DataHarmonizer-generated csv,tsv,xls,xlsx files. We look forward to feedback on using this below.

Basically, the linkml-validate command is good for the .json or .yaml data format, but the tabular csv,tsv,xls,xlsx input formats often don't validate well for two main reasons which are resolved by dh-validator.py generating a temporary .yaml file version of the tabular input with necessary adjustments made according to the given schema. dh-validator.py then sends this to linkml-validate for processing. The following adjustments are made:

We will be evolving this script to give a report of any miss-matched columns/fields, to facilitate having older tabular data validated in a newer LinkML schema version for example.