Open ptth222 opened 6 months ago
TODO: change the test from: self.assertTrue("Error found when validating ISA tab: {}".format(report['errors'])) to: self.fail("Error found when validating ISA tab: {}".format(report['errors']))
and fix the test files in response to errors detected.
A few of the PRs I still have open and am working on have all ran into this issue, and I need some input from you guys to resolve it.
In tests/validators/test_validate_test_data.py there are tests like:
In #551 though many of them changed to:
I don't think this was a wise change. The tests changed from testing that the data had no errors, to testing that it does have errors, and doesn't specify the errors. I think a better way to handle this is either to fix the errors in that data or explicitly look for the errors in that data, so you can know if changes are adding or subtracting errors in the future. The reason this is causing me issues for my PRs is because I am trying to fix some of the validations so they are inline with other parts of the code and examples, but the code and examples don't align with each other.
If we could hash out a few things about what is correct I can change the validation and test data to be correct and restore these tests to how they used to be.
This has a few issues:
preprocess
function in ProcessSequenceFactory.py warns about this.Questions about these issues:
ProcessSequenceFactory
to error, it adds the columns with "unknown" for the value, I think this should still be at least a warning in validation. I don't fully understand what is going on here. Was there a time where the "Name" columns was all that was necessary and you didn't need to specify the protocol? It would be nice to have some consensus about this and make the code and examples align with that.ProcessSequenceFactory
handles "Factor Values" in a unique way. It splits columns into object groups and then loops over the object groups to parse them into model objects. For things like "Characteristics" and "Parameters" it only looks for them within the object group, but for "Factor Values" it will look for them within the entire data frame. This behavior assumes or suggests that there can only be 1 "Sample Name" column since it adds all factors in the entire file every time it finds a "Sample Name" column. This is correct for assays, but can studies only have 1 "Sample Name" column? ThisProcessSequenceFactory
also suggests that "Factor Value" columns can be anywhere, but this makes validation more difficult. I would recommend requiring "Factor Values" to be after the "Sample Name" column just like "Characteristics" to make things more consistent.I have also come across some other issues while investigating this that aren't illustrated in the example.
==
instead ofstartswith
like "Protocol REF" columns are.get_value
used byProcessSequenceFactory
requires a specific order, but is it supposed to be that way? Also all of the columns must be present, you can't just have a "Unit" column for instance. I have written some preliminary code that changes both of these conditions. It would make it so if just "Unit" is present it still gets added and if they are in any order it still gets created.If it is easier to meet and discuss this I am available.