cidgoh / DataHarmonizer

A standardized browser-based spreadsheet editor and validator that can be run offline and locally, and which includes templates for SARS-CoV-2 and Monkeypox sampling data. This project, created by the Centre for Infectious Disease Genomics and One Health (CIDGOH), at Simon Fraser University, is now an open-source collaboration with contributions from the National Microbiome Data Collaborative (NMDC), the LinkML development team, and others.
MIT License
97 stars 27 forks source link

Refactor validation code and support LinkML rules #420

Closed pkalita-lbl closed 11 months ago

pkalita-lbl commented 12 months ago

Fixes #370

Summary of changes

  1. These changes add a new Validator class. This class is designed to accept a schema (the same JavaScript object representation of a LinkML schema that the DataHarmonizer class currently accepts) and a target class. It provides a method to validate rows of data regardless of where the data came from (it doesn't know anything about Handsontable).

    Generally speaking the way it works is that, as needed, it translates the target class in the schema into a series of validation functions and caches those functions. For example, while validating the first row of a dataset it might build, cache, and execute a function that knows how to validate a string according to the my_id slot. Then when validating the second row it only has to look up the cached function for the my_id slot and execute it against the value in the second row.

    Having this separated out into a standalone class makes it easier to unit test (see tests/Validator.test.js) and extend. In addition to what was previously validated, the Validator class also validates rules on LinkML classes.

  2. In the main DataHarmonizer class, a Validator instance is created in the constructor and whenever the schema changes. When the template changes the Validator instance's target class is updated. The getInvalidCells method now simply defers to the Validator.validate method.

  3. Previously, the DataHarmonizer.getInvalidCells method did a mixture of repairs and validation. The repair operations have been moved to a separate doPreValidationRepairs method and the DataHarmonizer.validate method is resposible for calling that method as well as getInvalidCells.

@ddooley I know this is kind of a big and complex set of changes, so I'd appreciate any testing you can do on this branch. Please let me know if you find anything concerning! Also happy to provide any clarification on the implementation details if needed.

ddooley commented 11 months ago

Great step forward. I will look at this in the next few days (just back from vacation). I like distinction between cleanup and validation passes.

ddooley commented 11 months ago

Aside from some testbed stuff we've now discussed, there was one piece of behaviour I wasn't quite sure of - the PRESENT / ABSENT test . Both a_big_number and a_small_number seem to be getting set to required when there's no value in present_or_absent_string, though logic seems to indicate that only a_small_number would be marked required in that case?

ddooley commented 11 months ago

Also, a click on row # column cell seems to trigger a javascript crash.

ddooley commented 11 months ago

I think the crash is related to some code expecting column header to have a 'title' field, but in the minimal LinkML template code for validation test we don't necessarily have that for each slot, so should be a simple fix to test for that.

ddooley commented 11 months ago

FYI, I see I accidentally deleted right click cell context menu for 'remove_row', 'row_above', 'row_below', so main branch will get this update, which should merge smoothly with your pull request.

pkalita-lbl commented 11 months ago

Also, a click on row # column cell seems to trigger a javascript crash.

Oh I actually see the same thing happen when running the master branch with the CanCOGeN Covid-19 template. So I don't think it's related to these changes. I can create a new issue for that and maybe address that in a separate PR?

Both a_big_number and a_small_number seem to be getting set to required when there's no value in present_or_absent_string, though logic seems to indicate that only a_small_number would be marked required in that case?

I have to admit I didn't really intend for the schema in Validator.test.js to be used as a whole -- you can see various test cases pulling out specific slots or groups of slots to test. So it's possible that I took a few shortcuts that make it not entirely self-consistent because of that. But I'll take a look at it.

ddooley commented 11 months ago

Re. : "So I don't think it's related to these changes." - indeed error predates the validation code. I'll fix.

pkalita-lbl commented 11 months ago

I've updated the validator test schema so that it doesn't repurpose slots between rules. Hopefully that makes your testing easier!

ddooley commented 11 months ago

re. "Oh I actually see the same thing happen when running the master" I've now fixed this on master.

ddooley commented 11 months ago

Small q: is non_overlapping_intervals" supposed to have ranges 0-20, 10-30 ? The ranges overlap.

pkalita-lbl commented 11 months ago

Small q: is non_overlapping_intervals" supposed to have ranges 0-20, 10-30 ? The ranges overlap.

Oh haha yeah they're supposed to overlap. I don't know why I named it that. Will update.

Edit: I just realized why I named it that. Yes, those ranges overlap, but because it uses exactly_one_of a value in the overlap is not allowed. The name is still a bit confusing, so I'll come up with something better.

Is it perhaps because range of those fields includes two separate sources - one being "date" and the other being a null value menu?

Yeah, this is where using the any_of construct can be a little tricky. When you added min/max values to the sample collection date slot did you do something like this (abridged for clarity)?

"sample collection date": {
  "minimum_value": "2022-01-01",
  "maximum_value": "{today}",
  "any_of": [
    {
      "range": "date"
    },
    {
      "range": "null value menu"
    }
  ]
}

There's a bit of ambiguity here. Is it implied that the min/max constraints should trickle down into both of the any_of conditions? If min/max do, what other slot definition attributes also trickle down? What if the "inherited" attributes don't really make sense for one of the conditions (like this case where min/max on range: "null value menu" doesn't make much sense)? I've chosen to not open that can of worms by not passing anything from the main slot definitions into the any_of definitions. So if you try this instead it should work:

"sample collection date": {
  "any_of": [
    {
      "range": "date"
      "minimum_value": "2022-01-01",
      "maximum_value": "{today}",
    },
    {
      "range": "null value menu"
    }
  ]
}

This feels like clearer modelling to me anyway.

ddooley commented 11 months ago

Great, that's the best solution. I didn't know that was possible with LinkML. I think its good to merge!

pkalita-lbl commented 11 months ago

Awesome! Thanks for all the testing and ideas. I'm going to merge now, but if you run into any unexpected issues or need help modifying any existing schemas to accommodate this please let me know!

ddooley commented 11 months ago

What do you think the probability of getting minimum_value and maximum_value to be range dependent in LinkML, rather than just integer? That becomes the one remaining priority.

I guess for more complex conditionals like "{sequencing date} minimum_value={collection date}" that these end up being stand-alone rules, rather than with some kind of shortcut like "minimum_value = {collection date}" on the range of {sequencing date} ?

pkalita-lbl commented 11 months ago

What do you think the probability of getting minimum_value and maximum_value to be range dependent in LinkML, rather than just integer? That becomes the one remaining priority.

The probability is pretty much 100%. There is an open issue about it: https://github.com/linkml/linkml/issues/1384. What I can't say is exactly when that'll happen. It would involve a change to the LinkML metamodel and those don't tend to happen very quickly.

But in the meantime you're not blocked, right? I think you should be able to continue to use the existing todos workaround because those get translated into minimum_value and maximum_value when the template is loaded: https://github.com/cidgoh/DataHarmonizer/blob/d1e477853b04cf099e5b1d0492f473983552640c/lib/DataHarmonizer.js#L362-L372

EDIT: Ah I just realized that translation of the todos is not happening at the right time for the Validator to pick it up. I'll make another update to make sure the existing todos continue to work.

I guess for more complex conditionals like "{sequencing date} minimum_value={collection date}" that these end up being stand-alone rules, rather than with some kind of shortcut like "minimum_value = {collection date}" on the range of {sequencing date} ?

Uh oh, somehow I missed that was a feature of the previous validation routine. I don't believe there's any way to specify a rule like that in LinkML. I may need to go back and add that to the new Validator class. Is there an existing schema that shows how you're currently expressing that?

ddooley commented 11 months ago

So at moment in our 10+ templates, there's only a few instances of our using "{today}", and a few instances of using the following to get one field's date minimum set in response to another's entry.

"sequencing date": {
  ...
  "todos": [
    ">={sample collection date}"
  ],

If a rule can be expressed to do the same as above then we can just switch our current templates to use that kind of rule. re. fields with 2 ranges, all our min and max tests pertain to the first range list datatype, so I think it would be smooth for the script to add the generally stated min/max if any, to it, in the case where there is more than 1 range. I guess this can be done browser side.

The "{today}" syntax is pretty special, since it doesn't point to another field. if your retrofitting via the "todos" annotation can make that work great, otherwise I'll take a crack at it.

pkalita-lbl commented 11 months ago

If a rule can be expressed to do the same as above then we can just switch our current templates to use that kind of rule

No, it cannot. There's nothing in LinkML currently that can express "the value of this slot need to be greater/lesser than the value of another slot".

I will update the Validator class to handle slot todos convention that is currently in your schemas (both the {today} special value and referencing other slots).

pkalita-lbl commented 11 months ago

PR to support existing conventions here: https://github.com/cidgoh/DataHarmonizer/pull/426