cidgoh / DataHarmonizer

A standardized browser-based spreadsheet editor and validator that can be run offline and locally, and which includes templates for SARS-CoV-2 and Monkeypox sampling data. This project, created by the Centre for Infectious Disease Genomics and One Health (CIDGOH), at Simon Fraser University, is now an open-source collaboration with contributions from the National Microbiome Data Collaborative (NMDC), the LinkML development team, and others.
MIT License
90 stars 23 forks source link

LinkML slot attributes for driving validations #300

Closed turbomam closed 2 years ago

turbomam commented 2 years ago

See also #267

This snippet shows how to list all of the attributes that are applicable to LinkML slots:

from linkml_runtime import SchemaView
meta_url = "https://raw.githubusercontent.com/linkml/linkml-model/main/linkml_model/model/schema/meta.yaml"
meta_view = SchemaView(meta_url)
sis = meta_view.class_induced_slots('slot_definition')
for i in sis:
    print(i.name)

Here are some that I think should be used for validations:

slot_attribute notes
maximum_value works with the NMDC templates now that @pkalita-lbl  PR'ed a float caster in #299.
minimum_value works with the NMDC templates now that @pkalita-lbl  PR'ed a float caster in #299.
multivalued todo, along with min and max cardinality? "|" shouldn't appear in non-multivalued columns?
identifier I think that this is being acted upon, like for MIxS' source_mat_id, which NMDC entitles XXX. Makes the column required and enforces uniqueness. Can only be applied to one column (i.e. one attribute per class). We need some other way to express that other columns should take unique values.
pattern works, as composed regular expressions
range I don't think any action is taken on ranges on their own. NMDC has data and code for matching ranges to (regular expression) patterns within the LinkML schema. Is DataHarmonizer still validating based on xsd types in the linkml-datastructure branch, the way it doe sin the main branch? If so, we should make sure common LinkML classes and types are related to the xsd types.
string_serialization NMDC has data and code for matching composed string_serializations to (regular expression) patterns, but it's really a misuse of string_serialization, which is meant to generate strings based on a template of attribute names. We should be doing this though structured_patterns instead. Furthermore, structured_patterns should take advantage of pre-composed chunks from LinkML settings. Is there a desire for any of this to happen in real time within DataHarmonizer? We will need to build up a library of settings and expansions, along with some understanding of the MIxS grammar, especially the use of ; and \|
structured_pattern @sujaypatil96 and others are working the expansion of structured_patterns. See string_serialization above
required works
id_prefixes todo? Value in columns with id_prefixes would have to begin with one of the prefixes, then a colon, then some local portion.
maximum_cardinality todo, along with multivalued?
minimum_cardinality todo, along with multivalued?
ifabsent todo? How would this relate to DataHarmonizer's mechanisms for default values?
range_expression todo for complex cases?
ddooley commented 2 years ago

Latest WIP-validation branch, which will be merged into linkml-datastructure, adds:

I can see we'll have to carefully go through each of these items!

ddooley commented 2 years ago

Discussed June 21. Will create separate issues for outstanding functionality required.