INCATools / ontology-access-kit

Ontology Access Kit: A python library and command line application for working with ontologies
https://incatools.github.io/ontology-access-kit/
Apache License 2.0
110 stars 24 forks source link

High-level interface via the Bioregistry #74

Open cthoyt opened 2 years ago

cthoyt commented 2 years ago

The bioontologies package has a very high-level interface for getting OBO graphs by prefix, since it can automatically look up the appropriate IRIs via its mappings to OBO Foundry.

import bioontologies

parse_results = bioontologies.get_obograph_by_prefix("go")
go_graph_document = parse_results.graph_document

This could be extended to OAK and also let you choose if you want the OWL, OBO, or OBO Graph JSON artifact to get consumed

cmungall commented 2 years ago

Cool, so if I understand things could be consumed at different levels

There is the actual obtaining of files themselves

bioregistry.get_json_download(prefix)
bioregistry.get_obo_download(prefix)
bioregistry.get_owl_download(prefix)

I think it would be useful for there to be a small lightweight library for accessing/downloading ontologies along different axes:

  1. format
  2. product (e.g. go-plus vs go vs go-basic)
  3. version (current or a specific version)
  4. provider (e.g. OLS for EFO, EDAM, etc; bioportal for some vocabularies not appropriate for OBO)

In theory if we follow semantic web practice we get all this for free but in practice it's annoying we don't have a standard way to e.g. list all versions of an ontology from a simple python call.

Happy for this to be in bioregistry but it seems that getting into registering versions of resources is out of scope - are you thinking this sort of thing could go in bioontologies? That could work. I think there is also an argument for a minimal obolibrary ontology that has some of these simple registry functions we use a lot e.g in the dashboard.

I see bioontologies also takes care of conversion using robot - I hope soon we can do this without a JVM dependency or invoking subprocess.

There is also the bioontologies pydantic layer which is nice but would take some thinking how best to integrate this - it could simply go in as it's own implementation but it would be good to avoid repetition of logic (e.g. the shortcuts that map to IAO and OIO predicates).

cthoyt commented 2 years ago

Since Bioregistry secretly was supporting PyOBO back in the beginning, I was also using it to catalog the locations of many ontologies, especially ones that are non-OBO Foundry and are hard to find and I often ran into OBO Foundry links being dead/out of date/unable to parse. However, solving the issues of versioning is indeed getting a bit out of scope. Unfortunately, doing this well will probably require having a huge set of manually curated rules for fixing inconsistencies in OBO Foundry and other ontology sources, so it's not clear to me if it would have a place in bioontologies (which is pretty lightweight at the moment).

Happy for this to be in bioregistry but it seems that getting into registering versions of resources is out of scope - are you thinking this sort of thing could go in bioontologies? That could work. I think there is also an argument for a minimal obolibrary ontology that has some of these simple registry functions we use a lot e.g in the dashboard.

Maybe we can finally go about packaging the OBO Foundry data and building a small set of tooling around accessing it as a pilot as a precursor for doing this in a more generally for ontologies. The Bioregistry actually already does this in a few ways, but it would be better to have it attached to the OBO Foundry itself

I see bioontologies also takes care of conversion using robot - I hope soon we can do this without a JVM dependency or invoking subprocess.

Agreed, would love to see this done natively in Python

There is also the bioontologies pydantic layer which is nice but would take some thinking how best to integrate this - it could simply go in as it's own implementation but it would be good to avoid repetition of logic (e.g. the shortcuts that map to IAO and OIO predicates).

There's a standardize() function that adds some opinionated parsing of URIs into bioregistry-compliant CURIEs and oher rules for practical downstream usage, but adding this kind of stuff is usually a slippery slope... Anyway would be happy to incorporate extra domain rules there if they're useful