IHEC / ihec-ecosystems

This repo is for code and documentation associated with the ihec-ecosystems working group
Apache License 2.0
5 stars 6 forks source link

Validator JSON output #92

Closed juettemann closed 4 years ago

juettemann commented 4 years ago

Here is the JSON output that I am currently expecting from the validator.

     "Filename.xml": [
         {   
             "accession_1": {
                 "errors": [
                     "missing",
                     [   
                         "experiment_ontology_curie",
                         "extraction_protocol_sonication_cycles",
                         "extraction_protocol_type_of_sonicator"                                                                                                                                         
                     ]   
                 ],  
           "ok": false,
                 "version": "1.1"
             }   
         },  
         {   
             "accession_2": {
                 "errors": [], 
                 "ok": true,
                 "version": "1.0" 
             }     
         }     
     ]     
 }
sitag commented 4 years ago

@juettemann for what xml?

sitag commented 4 years ago

@juettemann see https://github.com/IHEC/ihec-ecosystems/blob/feb2020/version_metadata/ihec_validator_base.py#L34-L47 and https://github.com/IHEC/ihec-ecosystems/blob/feb2020/version_metadata/ihec_validator_base.py#L14-L20 . We only track the latest schema the xml validates against.

We can track all schemas and not just latest, but if there are multiple valid ones, which one do we put in the xml we write?

If 1.1 validates then we don't try 1.0 right now, since if both validate then we don't know what to put in the xml.

sitag commented 4 years ago

head of feb2020 now runs all but only keeps first.

$ python -m $m -sample -out:$m/examples/samples.versioned.xml $m/examples/sample_1.0_1.1.xml -dbg
...
# xml validates [against:schemas/xml/SRA.sample.xsd]... True [version_metadata/examples/samples.versioned.xml]
ok
{
    "version_metadata/examples/sample_1.0_1.1.xml": [
        {
            "ENCBS054KUO": {
                "errors": [],
                "ok": true,
                "version": "1.1"
            }
        },
        {
            "ENCBS054KUO": {
                "errors": [],
                "ok": true,
                "version": "1.0"
            }
        }
    ]
}
juettemann commented 4 years ago

@sitag I saved this output from an old run, it was meant to be conceptual, showing one successful and one failed run. I'll do more test with incomplete files and add them to the examples. I agree that we should only keep the latest version.

sitag commented 4 years ago

@juettemann but that output is not correct. if 1.1 fails then 1.0 will be tested as well, so if you have fails you have all the versions. i am not sure what is the expected outcome from this issue? is there a change you want?

juettemann commented 4 years ago

@sitag OK, I misunderstood/misread. It all sounds good to me. I agree to only keep the highest successful version. Your example above still shows both Following your description, I would expect something like this:

Successful run 1.1, 1.0 skipped:

{
    "version_metadata/examples/sample_1.0_1.1.xml": [
        {
            "ENCBS054KUO": {
                "errors": [],
                "ok": true,
                "version": "1.1"
            }
        },
    ]
}

Successful 1.0, failed 1.1:

{
    "version_metadata/examples/sample_1.0_1.1.xml": [
        {
            "ENCBS054KUO": {
                "errors": [
                     "missing",
                     [   
                         "experiment_ontology_curie",
                      ]
                 ],
                "ok": false,
                "version": "1.1"
            }
        },
        {
            "ENCBS054KUO": {
                "errors": [],
                "ok": true,
                "version": "1.0"
            }
        }
    ]
}

Failed 1.0, failed 1.1

{
    "version_metadata/examples/sample_1.0_1.1.xml": [
        {
            "ENCBS054KUO": {
                "errors": [
                     "missing",
                     [   
                         "experiment_ontology_curie",
                      ]
                 ],
                "ok": false,
                "version": "1.1"
            }
        },
        {
            "ENCBS054KUO": {
                "errors": [
                     "missing",
                     [   
                         "experiment_ontology_uri",
                      ]
                 ],
                "ok": false,
                "version": "1.0"
            }
        }
    ]
}
sitag commented 4 years ago

the head of feb2020 now adjust for @zxenia 's fixes and reverts to only checking till first pass. @juettemann if you can provide separate xmls with these cases then we can hook that into the testbench.

juettemann commented 4 years ago

XMLs are in a separate PR: https://github.com/IHEC/ihec-ecosystems/pull/96 I am not married to any of the ideas, happy to discuss changes.

juettemann commented 4 years ago

Current output (I removed the 1.1 one)

  "/homes/juettema/src/ihec-ecosystems/version_metadata/examples/sample-type_errors.xml": [
        {
            "ENCBS188BKX": {
                "error_type": "jsonschema",
                "errors": [
                    "__total_errors__:4",
                    "#__validation_error_in__: ENCBS188BKX \n\n# donor_sex.0: 'Femalxe' is not one of ['Male', 'Female', 'Unknown', 'Mixed']",
                    "--------------------------------------------------",
                    "#__validation_error_in__: ENCBS188BKX \n\n# donor_age.0: 'N' is not valid under any of the given schemas",
                    "__schema_id__:1",
                    "\t'N' is not of type 'number'",
                    "__schema_id__:2",
                    "\t'N' is not one of ['90+', 'NA']",
                    "__schema_id__:3",
                    "\t'N' does not match '^\\\\d+-\\\\d+$'",
                    "--------------------------------------------------",
                    "#__validation_error_in__: ENCBS188BKX \n\n# donor_life_stage.0: 'baby' is not one of ['embryonic', 'fetal', 'postnatal', 'newborn', 'child', 'adult', 'unknown']",
                    "--------------------------------------------------",
                    "#__validation_error_in__: ENCBS188BKX \n\n# disease.0: None is not of type 'string'",
                    "--------------------------------------------------"
                ],
                "ok": false,
                "version": "1.0"
            }
        }
    ]
}
  1. Can we remove __total_errors__:4
  2. Can we remove the dash-line, and also "#validation_error_in: ENCBS188BKX \n\n#"
  3. Can donor_age go into prevalidation?

To get something like this:

{
  "ENCBS188BKX": {
    "error_type": "jsonschema",
      "errors": [
        "donor_sex.0: 'Femalxe' is not one of ['Male', 'Female', 'Unknown', 'Mixed']",
        "donor_life_stage.0: 'baby' is not one of ['embryonic', 'fetal', 'postnatal', 'newborn', 'child', 'adult', 'unknown']",
        "disease.0: None is not of type 'string'",
        ""
      ],
      "ok": false,
      "version": "1.0"
  }
}
sitag commented 4 years ago

@juettemann head of feb2020 will give errors like below. can you check that this is correct and what you want?

        {
            "ENCBS188BKX": {
                "error_type": "jsonschema",
                "errors": [
                    "donor_sex.0: 'Femalxe' is not one of ['Male', 'Female', 'Unknown', 'Mixed']",
                    "donor_age.0: 'N' is not valid under any of the given schemas",
                    "donor_life_stage.0: 'baby' is not one of ['embryonic', 'fetal', 'postnatal', 'newborn', 'child', 'adult', 'unknown']",
                    "disease.0: None is not of type 'string'"
                ],
                "ok": false,
                "version": "1.0"
            }
        }
juettemann commented 4 years ago

The above looks good, thanks.

A few more small issues:

-- sample-Cell_Line-BIOMATERIAL_TYPE_only.xml Remove additional array

{
  "ENCBS046ENC": {
    "error_type": "__prevalidation__",
      "errors": [
        "missing",
      [
        "batch",
        "differentiation_method",
        "differentiation_stage",
        "line",
        "lineage",
        "medium",
        "passage",
        "sex"
      ]
      ],
      "ok": false,
      "version": "1.1"
  }
},

New (alternatively attach ' is a required property' to be aligned to the jsonschema output)

{
  "ENCBS046ENC": {
    "error_type": "__prevalidation__",
    "errors": [
      "missing: batch",
      "missing: differentiation_method",
      "missing: differentiation_stage",
      "missing: line",
      "missing: lineage",
      "missing: medium",
      "missing: passage",
      "missing: sex"
    ],
    "ok": false,
    "version": "1.1"
  }
},

-- sample-Cell_Line-pass_prevalidation.xml Can the leading ": " be removed?

{
  "ENCBS046ENC": {
    "error_type": "jsonschema",
      "errors": [
        ": 'biological_replicates' is a required property",
        ": 'biomaterial_provider' is a required property",
        ": 'disease' is a required property",
        ": 'disease_ontology_uri' is a required property",
        ": 'sample_ontology_uri' is a required property",
        ": 'treatment' is a required property"
      ],
      "ok": false,
      "version": "1.1"
  }
},

New

{
  "ENCBS046ENC": {
    "error_type": "jsonschema",
    "errors": [
      "'biomaterial_provider' is a required property",
      "'disease' is a required property",
      "'disease_ontology_uri' is a required property",
      "'sample_ontology_uri' is a required property",
      "'treatment' is a required property"
    ],
    "ok": false,
    "version": "1.0"
  }
}
sitag commented 4 years ago

@juettemann

I have added the following function:

https://github.com/IHEC/ihec-ecosystems/blob/96042eaed0f3f54d1b3455d8dbd28d0b983a815f/version_metadata/io_adaptor.py#L16

The input to it right now is the current error log json. You can change the json however you like in that function and that is the output it will write.

Can you do whatever you need inside that function?

btw what are we doing about @zxenia's ontology validation.

juettemann commented 4 years ago

@sitag There are basically only 2 things I am asking for

  1. Consistent output. At the moment "errors" is sometimes a string, sometimes an array, and sometimes an array of arrays. "errors" should always be an array with each element being one error, like this one:

    {
    "ENCBS046ENC": {
    "error_type": "jsonschema",
    "errors": [
      "'biomaterial_provider' is a required property",
      "'disease' is a required property",
      "'disease_ontology_uri' is a required property",
      "'sample_ontology_uri' is a required property",
      "'treatment' is a required property"
    ],
    "ok": false,
    "version": "1.0"
    }
    }

    I am not fussed about the single quotes or "is a required property".

  2. Assertion error. Unless there is a syntax mistake (e.g. missing closing tag) in the XML, there should not be an assertion error. Please let me know if there are reasons against it.

I have not looked too much into the codebase of the validator, so it would likely be much faster if you could do that. Doing all post processing in one method seems also counterintuitive. I would think that each module/method that is used to validate returns the correct output. Happy to be corrected if that is not the proper way.

sitag commented 4 years ago

@juettemann can you provide examples where errors is not an array.

I have not looked too much into the codebase of the validator, so it would likely be much faster if you could do that.

Like I said, the json that goes into the file passes through that function. You don't need to know anything about the code base. Just what you get into that function and change it into whatever you want to send out. Most of the changes to want are find and replaces, we can avoid a week long lag here each time.

sitag commented 4 years ago

@juettemann I am not sure what you mean

"errors" is sometimes a string, sometimes an array,

Errors are always a list of 2-tuples: the first entry is what kind of error (missing attributes, semantic error etc), the second item if list of strings describing each error. Tuples in json get mapped to lists.

sitag commented 4 years ago

@juettemann Cannot move forward till you respond. Please in future provide context and complete details with you request.

juettemann commented 4 years ago

@sitag

I tried 2 ways to explain what the issues are.

  1. Detailed issues for the different tests: https://github.com/IHEC/ihec-ecosystems/issues/92#issuecomment-616903771

  2. More General, showing what should always been the format, no matter if it comes from prevalidation or jsonschema, Experiment or Sample: https://github.com/IHEC/ihec-ecosystems/issues/92#issuecomment-618248280

Please note that I provided the same for Experiment: https://github.com/IHEC/ihec-ecosystems/issues/100#issue-603627258

Further have I created all necessary test files for both, Experiment and Samples.

Currently we have several different formats (all examples are shortened, but either is taken from this issue or #100)

String

Experiment-only_LIBRARY_STRATEGY": {
    "error_type": "__prevalidation__",
    "errors": "__mising_both__:__experiment_ontology_uri+experiment_type__",

List

"ENCBS188BKX": {
                "error_type": "jsonschema",
                "errors": [
                    "donor_sex.0: 'Femalxe' is not one of ['Male', 'Female', 'Unknown', 'Mixed']",

Two-dimensional list

"ENCBS046ENC": {
    "error_type": "__prevalidation__",
      "errors": [
        "missing",
      [
        "batch",

I did suggest a normal list, but I don't mind if you prefer a two-dimensional one. As long as it is consistent.

Not having sometimes "missing" and sometimes "is a required property" is a bonus, as well as removing the leading colon and space "_: 'biologicalreplicates' is a required property".

sitag commented 4 years ago

@juettemann

You original comments are unclear, which is why I have asked you to clarify - which you still haven't done. I am struggling to understand your comment: https://github.com/IHEC/ihec-ecosystems/issues/92#issuecomment-618248280

What is inconsistent? Give example, please. When I look at the code, it's always (aka consistently ) a list of tuples. But maybe I missed something.

In your claim about "Output should be consistent", you have provided no context, and no examples of where the output is "inconsistent". Please provide details for that comment. Give me an example where this is string, and one with a list.

You have requested multiple changes for "ENCBS046ENC" - can you collapse them into one request?

Your other comment is a minor formatting change, and another about assertions is a non issue. Since you are parsing the json at some point anyway (as you claim here: https://github.com/IHEC/ihec-ecosystems/issues/100#issuecomment-618250151), it make sense that you marshall into whatever you want. I can do this for you here, if for some reason you are averse to parsing json.

For some of these assertions, if they fail they there's option but to quit, so that is just something you will need to watch. Instead of an assertion failure, you will just have quiet exit with a logged message and no json.

sitag commented 4 years ago

Not having sometimes "missing" and sometimes "is a required property" is a bonus, as well as removing the leading colon and space ": 'biological_replicates' is a required property".

This is not going to happen. One error is prevalidator, one is validator.

sitag commented 4 years ago

@juettemann changes pushed to head of feb2020

juettemann commented 4 years ago

The error messages for the JSON look good, closing.