Closed pkalita-lbl closed 11 months ago
Great step forward. I will look at this in the next few days (just back from vacation). I like distinction between cleanup and validation passes.
Aside from some testbed stuff we've now discussed, there was one piece of behaviour I wasn't quite sure of - the PRESENT / ABSENT test . Both a_big_number and a_small_number seem to be getting set to required when there's no value in present_or_absent_string, though logic seems to indicate that only a_small_number would be marked required in that case?
Also, a click on row # column cell seems to trigger a javascript crash.
I think the crash is related to some code expecting column header to have a 'title' field, but in the minimal LinkML template code for validation test we don't necessarily have that for each slot, so should be a simple fix to test for that.
FYI, I see I accidentally deleted right click cell context menu for 'remove_row', 'row_above', 'row_below', so main branch will get this update, which should merge smoothly with your pull request.
Also, a click on row # column cell seems to trigger a javascript crash.
Oh I actually see the same thing happen when running the master
branch with the CanCOGeN Covid-19 template. So I don't think it's related to these changes. I can create a new issue for that and maybe address that in a separate PR?
Both a_big_number and a_small_number seem to be getting set to required when there's no value in present_or_absent_string, though logic seems to indicate that only a_small_number would be marked required in that case?
I have to admit I didn't really intend for the schema in Validator.test.js
to be used as a whole -- you can see various test cases pulling out specific slots or groups of slots to test. So it's possible that I took a few shortcuts that make it not entirely self-consistent because of that. But I'll take a look at it.
Re. : "So I don't think it's related to these changes." - indeed error predates the validation code. I'll fix.
I've updated the validator test schema so that it doesn't repurpose slots between rules. Hopefully that makes your testing easier!
re. "Oh I actually see the same thing happen when running the master" I've now fixed this on master.
Small q: is non_overlapping_intervals" supposed to have ranges 0-20, 10-30 ? The ranges overlap.
Small q: is non_overlapping_intervals" supposed to have ranges 0-20, 10-30 ? The ranges overlap.
Oh haha yeah they're supposed to overlap. I don't know why I named it that. Will update.
Edit: I just realized why I named it that. Yes, those ranges overlap, but because it uses exactly_one_of
a value in the overlap is not allowed. The name is still a bit confusing, so I'll come up with something better.
Is it perhaps because range of those fields includes two separate sources - one being "date" and the other being a null value menu?
Yeah, this is where using the any_of
construct can be a little tricky. When you added min/max values to the sample collection date
slot did you do something like this (abridged for clarity)?
"sample collection date": {
"minimum_value": "2022-01-01",
"maximum_value": "{today}",
"any_of": [
{
"range": "date"
},
{
"range": "null value menu"
}
]
}
There's a bit of ambiguity here. Is it implied that the min/max constraints should trickle down into both of the any_of
conditions? If min/max do, what other slot definition attributes also trickle down? What if the "inherited" attributes don't really make sense for one of the conditions (like this case where min/max on range: "null value menu"
doesn't make much sense)? I've chosen to not open that can of worms by not passing anything from the main slot definitions into the any_of
definitions. So if you try this instead it should work:
"sample collection date": {
"any_of": [
{
"range": "date"
"minimum_value": "2022-01-01",
"maximum_value": "{today}",
},
{
"range": "null value menu"
}
]
}
This feels like clearer modelling to me anyway.
Great, that's the best solution. I didn't know that was possible with LinkML. I think its good to merge!
Awesome! Thanks for all the testing and ideas. I'm going to merge now, but if you run into any unexpected issues or need help modifying any existing schemas to accommodate this please let me know!
What do you think the probability of getting minimum_value and maximum_value to be range dependent in LinkML, rather than just integer? That becomes the one remaining priority.
I guess for more complex conditionals like "{sequencing date} minimum_value={collection date}" that these end up being stand-alone rules, rather than with some kind of shortcut like "minimum_value = {collection date}" on the range of {sequencing date} ?
What do you think the probability of getting minimum_value and maximum_value to be range dependent in LinkML, rather than just integer? That becomes the one remaining priority.
The probability is pretty much 100%. There is an open issue about it: https://github.com/linkml/linkml/issues/1384. What I can't say is exactly when that'll happen. It would involve a change to the LinkML metamodel and those don't tend to happen very quickly.
But in the meantime you're not blocked, right? I think you should be able to continue to use the existing todos
workaround because those get translated into minimum_value
and maximum_value
when the template is loaded: https://github.com/cidgoh/DataHarmonizer/blob/d1e477853b04cf099e5b1d0492f473983552640c/lib/DataHarmonizer.js#L362-L372
EDIT: Ah I just realized that translation of the todos
is not happening at the right time for the Validator
to pick it up. I'll make another update to make sure the existing todos
continue to work.
I guess for more complex conditionals like "{sequencing date} minimum_value={collection date}" that these end up being stand-alone rules, rather than with some kind of shortcut like "minimum_value = {collection date}" on the range of {sequencing date} ?
Uh oh, somehow I missed that was a feature of the previous validation routine. I don't believe there's any way to specify a rule like that in LinkML. I may need to go back and add that to the new Validator
class. Is there an existing schema that shows how you're currently expressing that?
So at moment in our 10+ templates, there's only a few instances of our using "{today}", and a few instances of using the following to get one field's date minimum set in response to another's entry.
"sequencing date": {
...
"todos": [
">={sample collection date}"
],
If a rule can be expressed to do the same as above then we can just switch our current templates to use that kind of rule. re. fields with 2 ranges, all our min and max tests pertain to the first range list datatype, so I think it would be smooth for the script to add the generally stated min/max if any, to it, in the case where there is more than 1 range. I guess this can be done browser side.
The "{today}" syntax is pretty special, since it doesn't point to another field. if your retrofitting via the "todos" annotation can make that work great, otherwise I'll take a crack at it.
If a rule can be expressed to do the same as above then we can just switch our current templates to use that kind of rule
No, it cannot. There's nothing in LinkML currently that can express "the value of this slot need to be greater/lesser than the value of another slot".
I will update the Validator
class to handle slot todos
convention that is currently in your schemas (both the {today}
special value and referencing other slots).
PR to support existing conventions here: https://github.com/cidgoh/DataHarmonizer/pull/426
Fixes #370
Summary of changes
These changes add a new
Validator
class. This class is designed to accept a schema (the same JavaScript object representation of a LinkML schema that theDataHarmonizer
class currently accepts) and a target class. It provides a method to validate rows of data regardless of where the data came from (it doesn't know anything about Handsontable).Generally speaking the way it works is that, as needed, it translates the target class in the schema into a series of validation functions and caches those functions. For example, while validating the first row of a dataset it might build, cache, and execute a function that knows how to validate a string according to the
my_id
slot. Then when validating the second row it only has to look up the cached function for themy_id
slot and execute it against the value in the second row.Having this separated out into a standalone class makes it easier to unit test (see
tests/Validator.test.js
) and extend. In addition to what was previously validated, theValidator
class also validatesrules
on LinkML classes.In the main
DataHarmonizer
class, aValidator
instance is created in the constructor and whenever the schema changes. When the template changes theValidator
instance's target class is updated. ThegetInvalidCells
method now simply defers to theValidator.validate
method.Previously, the
DataHarmonizer.getInvalidCells
method did a mixture of repairs and validation. The repair operations have been moved to a separatedoPreValidationRepairs
method and theDataHarmonizer.validate
method is resposible for calling that method as well asgetInvalidCells
.@ddooley I know this is kind of a big and complex set of changes, so I'd appreciate any testing you can do on this branch. Please let me know if you find anything concerning! Also happy to provide any clarification on the implementation details if needed.