IHEC / ihec-ecosystems

This repo is for code and documentation associated with the ihec-ecosystems working group
Apache License 2.0
5 stars 6 forks source link

Ontology validation #48

Closed zxenia closed 4 years ago

zxenia commented 5 years ago

This PR addresses the issue of ontology validation.

The goals are:

  1. to validate that only six accepted ontologies used

  2. to check that all used ontology terms are valid terms

Five out of six accepted ontologies are validated via Ontology Lookup API service

One - NCI Metathesaurus - doesn't provide an API service. Therefore there is no validation of NCI terms for now.

Ontology validation takes a CURIE as an input, despite the schema has not been updated yet (PR #37 is open). It doesn't validate CURIE format, this I think should be done on schema spec level (e.g. a regex pattern).

Note: Current branch is a feature branch from validator_reports

sitag commented 4 years ago

@zxenia

too many commits to review each individual one.

this is a possibly a bug: https://github.com/IHEC/ihec-ecosystems/blob/validate_ontology/IHEC_Data_Hub/ontology_lookup.py#L54-L61

i don't see what stops current_ontology == rule_ontology from being None == None? this will evaluate to true and not do any validation.

https://github.com/IHEC/ihec-ecosystems/blob/validate_ontology/IHEC_Data_Hub/ontology_lookup.py#L41

can you change it so it does not return a partially filled hash?

can you separate the logging from functionality for each function:

def log(self, x):
    logging.getLogger()....

then self.log(msg) , so the api user can control it.

i would have written this as a service instead of per term, but i can wrap it as one, so your call.

zxenia commented 4 years ago

Hi @sitag ,

  1. rule_ontology can't be None. The method relies on the current metadata spec: it takes ontology_type which is a key in schema (e.g. 'sample_ontology_curie') , then the values of 'ontology_type' are set in ontology_rules - those are taken from the metadata spec.

  2. can you change it so it does not return a partially filled hash?

    Could you give me an example here, I can't reproduce it. Currently, it is expected that the curie has been validated by json schema - we have the regex pattern check for curie. So it takes a valid curie in the form of e.g. uberon:0013540, splits it and returns dict {'ontology_name': 'uberon', 'curie': '0013540'}

sitag commented 4 years ago

@zxenia But you never check that parameter 'ontology_type' to the function is actually in the hash; if I use this as a library and pass something wrong in there then that error should not quietly pass. It would make it very hard to debug. You need to raise an error if it's not in the hash, instead of quietly using None. In general, KeyErrors are meaningful, and using .get hides them.

zxenia commented 4 years ago

Hi @sitag, I added a logger method for OntologyLookup, you can mute the logs and pass your error messages.

All logs from ontology_lookup and validateOntologies() can be muted by setting: logging.getLogger('ontology_lookup').disabled = True

logging.getLogger('validateOntologies').disabled = True

or setting the logs level

logging.getLogger('ontology_lookup').setLevel(logging.ERROR)

sitag commented 4 years ago

@zxenia looks okay to me