Closed juettemann closed 4 years ago
@juettemann for what xml?
@juettemann see https://github.com/IHEC/ihec-ecosystems/blob/feb2020/version_metadata/ihec_validator_base.py#L34-L47 and https://github.com/IHEC/ihec-ecosystems/blob/feb2020/version_metadata/ihec_validator_base.py#L14-L20 . We only track the latest schema the xml validates against.
We can track all schemas and not just latest, but if there are multiple valid ones, which one do we put in the xml we write?
If 1.1 validates then we don't try 1.0 right now, since if both validate then we don't know what to put in the xml.
head of feb2020 now runs all but only keeps first.
$ python -m $m -sample -out:$m/examples/samples.versioned.xml $m/examples/sample_1.0_1.1.xml -dbg
...
# xml validates [against:schemas/xml/SRA.sample.xsd]... True [version_metadata/examples/samples.versioned.xml]
ok
{
"version_metadata/examples/sample_1.0_1.1.xml": [
{
"ENCBS054KUO": {
"errors": [],
"ok": true,
"version": "1.1"
}
},
{
"ENCBS054KUO": {
"errors": [],
"ok": true,
"version": "1.0"
}
}
]
}
@sitag I saved this output from an old run, it was meant to be conceptual, showing one successful and one failed run. I'll do more test with incomplete files and add them to the examples. I agree that we should only keep the latest version.
@juettemann but that output is not correct. if 1.1 fails then 1.0 will be tested as well, so if you have fails you have all the versions. i am not sure what is the expected outcome from this issue? is there a change you want?
@sitag OK, I misunderstood/misread. It all sounds good to me. I agree to only keep the highest successful version. Your example above still shows both Following your description, I would expect something like this:
Successful run 1.1, 1.0 skipped:
{
"version_metadata/examples/sample_1.0_1.1.xml": [
{
"ENCBS054KUO": {
"errors": [],
"ok": true,
"version": "1.1"
}
},
]
}
Successful 1.0, failed 1.1:
{
"version_metadata/examples/sample_1.0_1.1.xml": [
{
"ENCBS054KUO": {
"errors": [
"missing",
[
"experiment_ontology_curie",
]
],
"ok": false,
"version": "1.1"
}
},
{
"ENCBS054KUO": {
"errors": [],
"ok": true,
"version": "1.0"
}
}
]
}
Failed 1.0, failed 1.1
{
"version_metadata/examples/sample_1.0_1.1.xml": [
{
"ENCBS054KUO": {
"errors": [
"missing",
[
"experiment_ontology_curie",
]
],
"ok": false,
"version": "1.1"
}
},
{
"ENCBS054KUO": {
"errors": [
"missing",
[
"experiment_ontology_uri",
]
],
"ok": false,
"version": "1.0"
}
}
]
}
the head of feb2020
now adjust for @zxenia 's fixes and reverts to only checking till first pass. @juettemann if you can provide separate xmls with these cases then we can hook that into the testbench.
XMLs are in a separate PR: https://github.com/IHEC/ihec-ecosystems/pull/96 I am not married to any of the ideas, happy to discuss changes.
Current output (I removed the 1.1 one)
"/homes/juettema/src/ihec-ecosystems/version_metadata/examples/sample-type_errors.xml": [
{
"ENCBS188BKX": {
"error_type": "jsonschema",
"errors": [
"__total_errors__:4",
"#__validation_error_in__: ENCBS188BKX \n\n# donor_sex.0: 'Femalxe' is not one of ['Male', 'Female', 'Unknown', 'Mixed']",
"--------------------------------------------------",
"#__validation_error_in__: ENCBS188BKX \n\n# donor_age.0: 'N' is not valid under any of the given schemas",
"__schema_id__:1",
"\t'N' is not of type 'number'",
"__schema_id__:2",
"\t'N' is not one of ['90+', 'NA']",
"__schema_id__:3",
"\t'N' does not match '^\\\\d+-\\\\d+$'",
"--------------------------------------------------",
"#__validation_error_in__: ENCBS188BKX \n\n# donor_life_stage.0: 'baby' is not one of ['embryonic', 'fetal', 'postnatal', 'newborn', 'child', 'adult', 'unknown']",
"--------------------------------------------------",
"#__validation_error_in__: ENCBS188BKX \n\n# disease.0: None is not of type 'string'",
"--------------------------------------------------"
],
"ok": false,
"version": "1.0"
}
}
]
}
To get something like this:
{
"ENCBS188BKX": {
"error_type": "jsonschema",
"errors": [
"donor_sex.0: 'Femalxe' is not one of ['Male', 'Female', 'Unknown', 'Mixed']",
"donor_life_stage.0: 'baby' is not one of ['embryonic', 'fetal', 'postnatal', 'newborn', 'child', 'adult', 'unknown']",
"disease.0: None is not of type 'string'",
""
],
"ok": false,
"version": "1.0"
}
}
@juettemann head of feb2020 will give errors like below. can you check that this is correct and what you want?
{
"ENCBS188BKX": {
"error_type": "jsonschema",
"errors": [
"donor_sex.0: 'Femalxe' is not one of ['Male', 'Female', 'Unknown', 'Mixed']",
"donor_age.0: 'N' is not valid under any of the given schemas",
"donor_life_stage.0: 'baby' is not one of ['embryonic', 'fetal', 'postnatal', 'newborn', 'child', 'adult', 'unknown']",
"disease.0: None is not of type 'string'"
],
"ok": false,
"version": "1.0"
}
}
The above looks good, thanks.
A few more small issues:
-- sample-Cell_Line-BIOMATERIAL_TYPE_only.xml Remove additional array
{
"ENCBS046ENC": {
"error_type": "__prevalidation__",
"errors": [
"missing",
[
"batch",
"differentiation_method",
"differentiation_stage",
"line",
"lineage",
"medium",
"passage",
"sex"
]
],
"ok": false,
"version": "1.1"
}
},
New (alternatively attach ' is a required property' to be aligned to the jsonschema output)
{
"ENCBS046ENC": {
"error_type": "__prevalidation__",
"errors": [
"missing: batch",
"missing: differentiation_method",
"missing: differentiation_stage",
"missing: line",
"missing: lineage",
"missing: medium",
"missing: passage",
"missing: sex"
],
"ok": false,
"version": "1.1"
}
},
-- sample-Cell_Line-pass_prevalidation.xml Can the leading ": " be removed?
{
"ENCBS046ENC": {
"error_type": "jsonschema",
"errors": [
": 'biological_replicates' is a required property",
": 'biomaterial_provider' is a required property",
": 'disease' is a required property",
": 'disease_ontology_uri' is a required property",
": 'sample_ontology_uri' is a required property",
": 'treatment' is a required property"
],
"ok": false,
"version": "1.1"
}
},
New
{
"ENCBS046ENC": {
"error_type": "jsonschema",
"errors": [
"'biomaterial_provider' is a required property",
"'disease' is a required property",
"'disease_ontology_uri' is a required property",
"'sample_ontology_uri' is a required property",
"'treatment' is a required property"
],
"ok": false,
"version": "1.0"
}
}
@juettemann
I have added the following function:
The input to it right now is the current error log json. You can change the json however you like in that function and that is the output it will write.
Can you do whatever you need inside that function?
btw what are we doing about @zxenia's ontology validation.
@sitag There are basically only 2 things I am asking for
Consistent output. At the moment "errors" is sometimes a string, sometimes an array, and sometimes an array of arrays. "errors" should always be an array with each element being one error, like this one:
{
"ENCBS046ENC": {
"error_type": "jsonschema",
"errors": [
"'biomaterial_provider' is a required property",
"'disease' is a required property",
"'disease_ontology_uri' is a required property",
"'sample_ontology_uri' is a required property",
"'treatment' is a required property"
],
"ok": false,
"version": "1.0"
}
}
I am not fussed about the single quotes or "is a required property".
Assertion error. Unless there is a syntax mistake (e.g. missing closing tag) in the XML, there should not be an assertion error. Please let me know if there are reasons against it.
I have not looked too much into the codebase of the validator, so it would likely be much faster if you could do that. Doing all post processing in one method seems also counterintuitive. I would think that each module/method that is used to validate returns the correct output. Happy to be corrected if that is not the proper way.
@juettemann can you provide examples where errors is not an array.
I have not looked too much into the codebase of the validator, so it would likely be much faster if you could do that.
Like I said, the json that goes into the file passes through that function. You don't need to know anything about the code base. Just what you get into that function and change it into whatever you want to send out. Most of the changes to want are find and replaces, we can avoid a week long lag here each time.
@juettemann I am not sure what you mean
"errors" is sometimes a string, sometimes an array,
Errors are always a list of 2-tuples: the first entry is what kind of error (missing attributes, semantic error etc), the second item if list of strings describing each error. Tuples in json get mapped to lists.
@juettemann Cannot move forward till you respond. Please in future provide context and complete details with you request.
@sitag
I tried 2 ways to explain what the issues are.
Detailed issues for the different tests: https://github.com/IHEC/ihec-ecosystems/issues/92#issuecomment-616903771
More General, showing what should always been the format, no matter if it comes from prevalidation or jsonschema, Experiment or Sample: https://github.com/IHEC/ihec-ecosystems/issues/92#issuecomment-618248280
Please note that I provided the same for Experiment: https://github.com/IHEC/ihec-ecosystems/issues/100#issue-603627258
Further have I created all necessary test files for both, Experiment and Samples.
Currently we have several different formats (all examples are shortened, but either is taken from this issue or #100)
String
Experiment-only_LIBRARY_STRATEGY": {
"error_type": "__prevalidation__",
"errors": "__mising_both__:__experiment_ontology_uri+experiment_type__",
List
"ENCBS188BKX": {
"error_type": "jsonschema",
"errors": [
"donor_sex.0: 'Femalxe' is not one of ['Male', 'Female', 'Unknown', 'Mixed']",
Two-dimensional list
"ENCBS046ENC": {
"error_type": "__prevalidation__",
"errors": [
"missing",
[
"batch",
I did suggest a normal list, but I don't mind if you prefer a two-dimensional one. As long as it is consistent.
Not having sometimes "missing" and sometimes "is a required property" is a bonus, as well as removing the leading colon and space "_: 'biologicalreplicates' is a required property".
@juettemann
You original comments are unclear, which is why I have asked you to clarify - which you still haven't done. I am struggling to understand your comment: https://github.com/IHEC/ihec-ecosystems/issues/92#issuecomment-618248280
What is inconsistent? Give example, please. When I look at the code, it's always (aka consistently ) a list of tuples. But maybe I missed something.
In your claim about "Output should be consistent", you have provided no context, and no examples of where the output is "inconsistent". Please provide details for that comment. Give me an example where this is string, and one with a list.
You have requested multiple changes for "ENCBS046ENC" - can you collapse them into one request?
Your other comment is a minor formatting change, and another about assertions is a non issue. Since you are parsing the json at some point anyway (as you claim here: https://github.com/IHEC/ihec-ecosystems/issues/100#issuecomment-618250151), it make sense that you marshall into whatever you want. I can do this for you here, if for some reason you are averse to parsing json.
For some of these assertions, if they fail they there's option but to quit, so that is just something you will need to watch. Instead of an assertion failure, you will just have quiet exit with a logged message and no json.
Not having sometimes "missing" and sometimes "is a required property" is a bonus, as well as removing the leading colon and space ": 'biological_replicates' is a required property".
This is not going to happen. One error is prevalidator, one is validator.
@juettemann changes pushed to head of feb2020
The error messages for the JSON look good, closing.
Here is the JSON output that I am currently expecting from the validator.