biolink / ontobio

python library for working with ontologies and ontology associations
https://ontobio.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
119 stars 30 forks source link

Use obonet obo parser as stopgap #179

Open cmungall opened 6 years ago

cmungall commented 6 years ago

Currently we convert obo to json using the java obographs library (via an owltools wrapper), which creates as odd dependency

The overall goal is stop consuming obo, but in the interim, let's swap in @dhimmel's obonet obo parser (which already maps to a networkx object - but check the directionality). Either via pypi or direct code lift... https://github.com/dhimmel/obonet/

dhimmel commented 6 years ago

Glad obonet will be able to help. Since this project appears to be Python 3 only, I don't envision any issues with installation from PyPI.

Let me know if anything comes up that I can help with. Happy to make small updates as necessary to obonet to meet your needs here.

dustine32 commented 6 years ago

I hooked in the obonet parser for handling .obo files. However, the result totals seem to differ from how ontobio was previously parsing the owltools JSON output.

Using go_1.0.obo file:

So (at least in total) 4180 nodes appear to not be getting parsed out with obonet.

I'll look into finding what exactly is differing between the two methods.

cmungall commented 6 years ago

probably treatment of alt_id?

Also can you do a test to make sure the results make semantic sense, the way things are represented within networkx may differ

dustine32 commented 6 years ago

@cmungall thanks for the pointer!

Would examples of testing for semantic sense be calling .search(), .traverse_nodes(), and .label() on the resulting ontology as in test_wd_ontol.py?

cmungall commented 6 years ago

try test_graph from test_local_json, but using https://github.com/geneontology/obographs/blob/master/src/test/resources/nucleus.obo as input

dustine32 commented 6 years ago

OK, lifting test_graph for nucleus.obo causes it to fail on the first assert checking for correct ID. Looks like the label checks will also fail since none of the labels are being grabbed.

I can try adding some more parsing logic inside ontobio to finish up the obonet-supplied graph with the goal to pass all of these asserts.

dhimmel commented 6 years ago

Looks like the label checks will also fail since none of the labels are being grabbed

Should obonet be parsing these labels? If so, I could see about adding that.

dustine32 commented 6 years ago

@dhimmel That would be awesome, thanks! Looking at the obonet code, it seems to be saving the term's properties in the node attributes, but I'm having trouble getting the attributes out in debugging.

dustine32 commented 6 years ago

OK @dhimmel , I'm able to get the labels/names out of the obonet-supplied nodes now. Simply an issue with my networkx knowledge. So I don't think you'll need to do anything for this.

dustine32 commented 6 years ago

So as I'm fighting through the semantic tests I realize that I'm essentially filling in owltools/obographs logic, some of it is making assumptions about the source obo file based on the networkx graph coming out of obonet. @cmungall Does this seem like an incorrect route for handling the obonet vs owltools result discrepancies?

If not, I'm thinking of two other options:

  1. Write a translation of owltools/obographs logic for python that will convert to JSON, bypassing obonet, then pass JSON into ontobio to create networkx.
  2. Do everything in obonet - insert both the owltools/obographs and networkx ontobio logic (primarily, the OboJsonMapper.add_obograph_digraph() stuff).
cmungall commented 6 years ago

I think I'm following but shall we just check with some examples? I think there is no guarantee that obonet will use the same choices about node properties (e.g. 'name' vs 'label') and edge properties (e.g. 'relation'). Is this what you mean?

dustine32 commented 6 years ago

Sorry, shoulda specified my latest issue. The 'name' vs 'label' thing is simple enough since obonet is kind enough to pass on all OBO node properties to the graph node, but the obonet graph is also failing this assert:

ancs = ont.ancestors(NUCLEUS)
assert CELL in ancs

Looking further, I think the graph directionality (as you heeded above) might be reversed (e.g. "cell part" is parent of "cell"):

>>> for n in ont.nodes():
...     print(n, ont.label(n), len(ont.ancestors(n)))
...
GO:0005575 cellular_component 9
GO:0005622 intracellular 4
GO:0005623 cell 6
GO:0005634 nucleus 0
GO:0043226 organelle 4
GO:0043227 membrane-bounded organelle 2
GO:0043229 intracellular organelle 2
GO:0043231 intracellular membrane-bounded organelle 1
GO:0044424 intracellular part 3
GO:0044464 cell part 5
BFO:0000050 part_of 0

>>> ont.ancestors('GO:0005623') # cell
['GO:0005634', 'GO:0044464', 'GO:0043231', 'GO:0005622', 'GO:0043229', 'GO:0044424']
>>> ont.children('GO:0005623') # cell
['GO:0005575']
>>> ont.parents('GO:0005623') # cell
['GO:0044464']

I guess this is an obonet issue itself unless there's something off about the nucleus.obo input that would cause this. I can play around more and suggest an obonet change if needed.

dhimmel commented 6 years ago

Looking further, I think the graph directionality (as you heeded above) might be reversed (e.g. "cell part" is parent of "cell")

The nomenclature with networkx gets confusing here. We encode directionality like "child --> parent", where the relationship type is something like is_a. I believe this is the standard approach for ontologies, but notice the directionality appears backwards.

ancestors is a networkx function. I think you actually want descendants to find the supernodes.

I'm open to any suggestions on how to improve this.

dustine32 commented 6 years ago

Hey again @dhimmel ! Sorry about the delay. I followed the route from NUCLEUS (GO:0005634) to CELL (GO:0005623) in the nucleus.obo file and it looks like it's a mix of is_a, part_of relationships: image Of course, the pic here is pulled from the entire GO ontology whereas the input for the obonet-parsed results above is the slimmed-down nucleus.obo file, which I just now committed. I'll take a look at how networkx's ancestors function traverses the graphs, it may only be following is_a relations...

Thanks again for your help!

cmungall commented 6 years ago

ontobio encodes subject-predicate-object triples via add_edge(object, subject, {relation=predicate})

This seems a bit backwards, but it means that nucleus is a descendant of cell, rather than vice versa

dustine32 commented 6 years ago

@cmungall Ahh, so that's what's going on. Thanks!

Then it looks like to get obonet to conform with what ontobio expects we'd have to change the add_edge() call within obonet.

Maybe we can add an option to obonet to reverse subject, object when adding edges? I think ontobio already has some option like this for something else. @dhimmel You think this sounds good? I can code the change in obonet and test it.

dustine32 commented 6 years ago

Found it: there's an invert_subject_object option in ontobio.golr_query. It looks like it's specifically for disease ontology stuff.

dhimmel commented 6 years ago

Maybe we can add an option to obonet to reverse subject, object when adding edges?

I agree an invert option for obonet.read_obo would be useful and happy to review a pull request for that... not sure if you still need obonet to have this functionality given the invert_subject_object function you mention above.

I'll take a look at how networkx's ancestors function traverses the graphs, it may only be following is_a relations...

networkx.acensectors is oblivious to relationship types, so it will follow all relationship types. We could write a smarter version that would only traverse the specified relationship types, if that would be useful.

cmungall commented 6 years ago

While we're just intending to use the parsing part of obonet for now I like to explore design patterns and see if there is opportunity for further reuse

in ontobio we have a wrapper object (ontol.py) onto networkx. You can pass options to the ontol.py to create a new graph with the filtered relations.

http://ontobio.readthedocs.io/en/latest/_modules/ontobio/ontol.html#Ontology.ancestors

I'm not super-happy with it. You incur a cost every time you make the filtered graph. Cacheing is an obvious approach but then for N relations you have 2^N potential cached graphs, so we'd need a swap in/out layer to avoid memory unpredictability...

We could do our own traversals but I assume networkx is utilizing nicely optimized code behind the scene, and at some point we end up rewriting neo4j or an owl reasoner...

dustine32 commented 6 years ago

For addressing directionality, we can probably take the networkx graph provided by obonet and just reverse the nodes on a copy.

cmungall commented 5 years ago

See also @althonos's https://pypi.org/project/fastobo/ -- this is complete implementation of the full obo spec, so we should switch to this