dhimmel / obonet

OBO-formatted ontologies → networkx (Python 3)
https://github.com/dhimmel/obonet/blob/main/examples/go-obonet.ipynb
Other
136 stars 28 forks source link

Upload the package to PyPI #3

Closed azneto closed 7 years ago

azneto commented 7 years ago

It would be nice to have this library uploaded to pypi in order to make it really easy to install. Keep up the good work.

dhimmel commented 7 years ago

@azneto okay will do!

The package name obo on PyPI would be ideal, but it's taken. @lyschoening is your obo project still under development or would you consider donating tat PyPI handle for this project? If so I think you can go to this URL (once logged in to PyPI) and add me as an owner.

lyschoening commented 7 years ago

It may be a bit selfish to book a name like that, but I reserved it out of frustration over the many existing Python-based parsers that are neither feature-complete nor reusable.

I haven't had had much time to actually implement it and I'd be happy to give the name away to someone who would try their hands on a faithful OBO read/write implementation.

However, it seems that this package is some sort of OBO-to-networkx converter that would only be useful to a very small subset of people who work with OBO, so maybe a different name would be better — unless you have larger ambitions for this project?

azneto commented 7 years ago

dhimmel had made a really wise decision to have the data loaded to a networkx. The resulting networkx produced by the obo library contains each and every content of the .obo file. Such data structure is very well documented and coupled with lots of processing functions. Therefore I disagree it would be useful to only a subset of people.

I tried other obo libraries and dhimmel's is the only one that really nailed it. The only thing it lacks is the write/export to obo function.

dhimmel commented 7 years ago

@lyschoening you're correct that this package currently only implements reading OBO files. I chose to encode the ontologies in networkx, since it provides the most pythonic representation and has the appropriate functionality.

I haven't implemented writing, since OBO is an unnecessarily complicated and poorly standardized format. Therefore, my use case (and focus) has been to get data out of OBO, but not to encode data back into OBO. However, write functionality is within scope of this repository, and I will assist with any contributions to enable it.

It's worth noting that the OBO format is on its way out (see https://github.com/dhimmel/obo/issues/2). Most ontologies have deprecated their OBO exports and instead rely foremost on OWL. In addition, there are a host of graph-serialization solutions that use more versatile formats such as JSON or XML.

When I first created this codebase, I was surprised that a canonical OBO parser for Python didn't already exist. I reluctantly wrote my own read implementation which attempts to implement the specification as best as possible. This was two years ago. As the OBO format continues to be replaced by more modern alternatives, I don't see the situation improving. In other words, I don't think waiting will result in a better OBO implementation in Python.

OBO-to-networkx converter that would only be useful to a very small subset of people who work with OBO

I expect that the primary need for python users is reading OBO-formatted ontologies. Even something basic, like getting a mapping of term ids to names, is difficult without this package.

maybe a different name would be better

I'm not totally opposed a different name and respect your stewardship of the domain. So up to you.

@cmungall (ontology expert) do you have any thoughts on whether obo is an appropriate package name?

dhimmel commented 7 years ago

@lyschoening in support of my "application" for the obo PyPI handle :smirk:, I've added some features, tests, and documentation:

  1. Usage section of the readme.
  2. Ability to read OBO files from paths and URLs, including compressed files.
  3. Tests for parsing the taxrank ontology and three ontologies from OBOFoundry (GO, DO, PATO)
cmungall commented 7 years ago

I'm less familiar with pypy guidelines than things like maven where there are namespaces to avoid conflicts.

I might tend towards oboformat, to distinguish the legacy format from the active community of Open Biological Ontologies, but I don't have strong opinions.

It perhaps depends if the goal is to expand into a more general library. As you point out the datamodel can be delegated to networkx for the majority of bioinformatics applications (though others may want dedicated objects for lexical elements).

As the json format takes off we'll want an easy to use python library for mapping this to networkx. It's fairly trivial to do this:

https://github.com/biolink/biolink-api/blob/cb9b3d50b9e301d5f01dedda75fd02459ad821a4/obographs/obograph_util.py#L20-L54

I'd be interested in coordinating on any such library.

I don't have much to add, you summarized the situation perfectly, I don't think there is a need for either a fully-complete obo parser or an obo writer.

dhimmel commented 7 years ago

I might tend towards oboformat, to distinguish the legacy format from the active community of Open Biological Ontologies, but I don't have strong opinions.

@cmungall I like the suggestion. The downside is I'll have to rename the repo and package to oboformat. But since there aren't a ton of users currently, it's not a bad time for a breaking change. I agree that an "Open Biological Ontologies" package may be worthy of the name in the future.

It perhaps depends if the goal is to expand into a more general library.

I've written a bunch of ontology reasoning functions atop networkx, which I may at some point migrate to this repo.

I'd be interested in coordinating on any such library.

Certainly, happy to coordinate formats, standards, or anything else between this repo and the JSON codebase.

In conclusion, oboformat is a good name for the current package, but could be a bit confining going forward. Since @lyschoening has the final say regarding pypi/obo, what's your opinion?

lyschoening commented 7 years ago

It perhaps depends if the goal is to expand into a more general library. As you point out the datamodel can be delegated to networkx for the majority of bioinformatics applications (though others may want dedicated objects for lexical elements).

I'd be one of those people who'd want dedicated objects for lexical elements. Like @dhimmel I couldn’t find a proper parser and began to write one. For reference, I have now published what I wrote so far over here: https://github.com/biosustain/obo As you can see, while it needs some more work, it is faithful to the specification and quite complete in what it parses and the data structures it parses into.

It’s very good news that the OBO format is on its way out. Keeping that in mind I do not want to hog the “obo” name for a soon-obsolete parser.

I’d still rather donate the name to a complete solution, whatever that might mean. My recommendation would be to name this package oboformat as suggested, or perhaps obo-networkx if the scope will remain more limited. I am also willing to rename my own parser to oboformat, but in that case I still think the “obo” name should go to a more complete solution.

Perhaps we can join forces somehow. What I am really interested in is taking an ontology term and checking if it is_a other ontology term. Knowing very little about networkx I assume a graph library will be very suitable to answering these questions. On the other hand, it's good to have a structured mapping to fall back to.

dhimmel commented 7 years ago

Okay I think I'll go with obonet. I like short names. It also seems that hypens or underscores in package names can cause issues -- I like when the name of a package is the same in all contexts.

What I am really interested in is taking an ontology term and checking if it is_a other ontology term.

@lyschoening, you're interested in transitive closure I believe. Is X a subtype of Y, according to is_a relationships? I'll try to add some examples on how to answer this sort of question.

dhimmel commented 7 years ago

Package on PyPI

@azneto the package is now on PyPI as obonet.

To install you can do:

pip install obonet

Then all you need to do is:

import obonet

graph = obonet.read(url_or_path)

I configured Travis CI for continuous deployment. When a new tag is added, Travis automatically deploys the new version to PyPI.

Tutorial

@lyschoening check out the new tutorial notebook. I think it could cover the operations you're interested in.

cmungall commented 7 years ago

you're interested in transitive closure I believe. Is X a subtype of Y, according to is_a relationships? I'll try to add some examples on how to answer this sort of question

For a great many use cases, this is as simple as making a networkx object with all edges for a given set of relations (edge labels) of interest, and then asking for ancestors. In fact in many cases the 'basic' version of many ontologies are pre-filtered in this way. In other cases you might want to explicitly filter. For example, in the SO if you want all transcript objects or all exon objects, you'd want to explicitly only use is_a (so you don't end up with parts). Or with an anatomy ontology you may want to restrict gene expression queries to is_a and part_of (and not propagate over develops_from, otherwise you end up with everything expressed in the embryo).

For other more advanced cases you actually want to factor the relation semantics in. For example, using chaining rules such as negatively_regulates o negatively_regulates -> positively_regulates, or using a hierarchy of relationship types (see RO for examples). While this can be done in an ad-hoc way, at this point you want to look at using an OWL reasoner which will take into account the semantics of the relations. But to stress, this is beyond most typical needs as this kind of computation is typically done ahead of time as part of the ontology release process with useful high level edges cached in the release version.