SynBioDex / tyto

Use ontology terms in your Python application
Apache License 2.0
19 stars 3 forks source link

Normalize function #75

Open jakebeal opened 1 year ago

jakebeal commented 1 year ago

I often want to put a URI into "normal form", i.e., the recommended form. Currently, this is done by tyto.X.get_uri_by_term(tyto.X.get_term_by_uri(term))

It would be nice to have normalization as an efficient convenience method.

bbartley commented 1 year ago

I think what you are asking for can be accomplished by the following:

uri = tyto.URI('https://identifiers.org/SO:0000167', tyto.SO)
jakebeal commented 1 year ago

Unfortunately, that does not seem to be the case:

>>> tyto.URI('http://identifiers.org/so/SO:0000316', tyto.SO)
'http://identifiers.org/so/SO:0000316'
>>> tyto.URI('https://identifiers.org/SO:0000316', tyto.SO)
'https://identifiers.org/SO:0000316'
>>> tyto.URI('https://nonsense_uri', tyto.SO)
'https://nonsense_uri'
bbartley commented 1 year ago

Is this what you are looking for?

>>> promoter = tyto.SO.promoter
>>> promoter
'https://identifiers.org/SO:0000167'
>>> tyto.SO._sanitize_uri(promoter)
'http://purl.obolibrary.org/obo/SO_0000167'
>>> tyto.SO._reverse_sanitize_uri('http://purl.obolibrary.org/obo/SO_0000167')
'https://identifiers.org/SO:0000167'
jakebeal commented 1 year ago

That's looking along the right lines, but I'm still a bit mystified, because _sanitize_uri is a) not caring if it's part of the ontology or not, and b) not returning the same URI that gets returned when I look up terms.

>>> tyto.SO._sanitize_uri('https://identifiers.org/SO:0000316')
'http://purl.obolibrary.org/obo/SO_0000316'
>>> tyto.SO.get_uri_by_term('promoter')
'https://identifiers.org/SO:0000167'
>>> tyto.SO._sanitize_uri('https://nonsense.uri')
'https://nonsense.uri'

Is there any function that I can give 'http://identifiers.org/so/SO:0000316', and it gives me the same result as get_uri_by_term (e.g., in this case 'https://identifiers.org/SO:0000167'?

bbartley commented 1 year ago

tyto.SO._reverse_sanitize_uri is a natural place to tuck this functionality. Currently it recognizes a purl namespace and converts it back to identifiers.org. It could also be extended to normalize from URIs with the pattern "'http://identifiers.org/so/".

From an SBOL perspective, I think your natural inclination would be to assume that the _sanitize method would return a URI in identifiers.org namespace. That is not the case. The logic behind _sanitize and _reverse_sanitize is that the query builder has to normalize (sanitize) a URI to a purl namespace in order to query the ontology servers (they recognize purl, not identifiers.org, which makes me question why SBOL chose to normalize on identifiers.org). Likewise, the ontology servers will return URIs in purl namespace, so they have to be "reverse sanitized" back into identifiers.org. The query builder typically does this under the hood, so the methods are private.

In any case, I could go ahead and implement a public normalize function with the functionality you requested, although, as noted above, it's a bit of a misnomer since all the ontology resources normalize on purl namespace.

jakebeal commented 1 year ago

Whatever makes sense under the hood is fine by me. The key that I need is for the results of tyto.ontology.get_uri_from_term() and tyto.ontology.normalize(uri) to be equal.

Implementing that function would be great! You can currently find my workaround version in the SBOL utilities workarounds at https://github.com/SynBioDex/SBOL-utilities/blob/2b8d6289cf2ed818deb95a34b27d7ea25567982c/sbol_utilities/workarounds.py#L24-L37

bbartley commented 1 year ago

Do you want it to throw an error if the given URI is not a member of the ontology, e.g., https://nonsense.uri ?

jakebeal commented 1 year ago

I'm fine with either throwing a lookup exception or returning None. For my first specific use case, it would be a little more convenient if it returned None, but I can make it work either way, so I think you should do what you think makes most sense from a tyto-centric perspective.

Maybe you could even have it be an optional argument to switch between the two behaviors that defaults to throwing an exception, but can be overridden to return none instead (sort of like directory creating has the exists_ok option).