Collaboration on obo parser and data model

dhimmel / obonet

OBO-formatted ontologies → networkx (Python 3)

https://github.com/dhimmel/obonet/blob/main/examples/go-obonet.ipynb

Other

136 stars 28 forks source link

Collaboration on obo parser and data model #4

Closed lyschoening closed 3 years ago

lyschoening commented 7 years ago

@dhimmel

I took the liberty of seeing what it would take to implement obonet with my obo package:

import networkx
from obo import Ontology

def read_obo(path_or_file):
    #with open_read_file(path_or_file) as fp:#
    with open(path_or_file, 'r') as fp:
        ontology = Ontology.read(fp)

    graph = networkx.MultiDiGraph(
        name=ontology.ontology,
        typedefs=ontology.typedefs,
        instances=ontology.instances,
        **ontology.tags)

    edge_tuples = []
    for term in ontology.terms:
        if term.is_obsolete:
            continue

        graph.add_node(term)

        for target_term in term.is_a:
            edge_tuples.append((term, 'is_a', ontology.terms[target_term]))

        for relationship in term.relationships:
            edge_tuples.append((term, relationship.type, ontology.terms[relationship.target_term]))

    for term_a, type_, term_b in edge_tuples:
        graph.add_edge(term_a, term_b, key=type_)

    return ontology, graph

The lookup I wanted to do checking if X is_a "sequence_feature" in SO. Ignoring for a minute that there are other relationships than "is_a" (as @cmungall pointed out), the lookup would be something like:


ontology, graph = read_obo('tests/files/so-xp.obo')

exon = ontology.term_by_name('exon')
sequence_feature = ontology.term_by_name('sequence_feature')

print(networkx.descendants(graph, exon))  # {<Term SO:0000001 ! region>, <Term SO:0000704 ! gene>, ...}
print(sequence_feature in networkx.descendants(graph, exon))  # True

If it would be interesting to you to collaborate on the parser and data model, let me know. It would be good not to be the only maintainer of the parser and for there to be a package that makes use of it. On the other hand I also understand if you prefer to do your own thing.

dhimmel commented 7 years ago

@lyschoening nice implementation. I think it would make sense to add your read_obo function to biosustain/obo. If you open a pull request on biosustain/obo, I'll leave the few comments I have on your implementation. This would result in two implementations to go from OBO to networkx.

While I like avoiding duplication of efforts, I think there'd be a valuable testing motivation. I'd like to see whether both packages produce the same networks. I'm not sure about your experience, but it's been hard to know whether my parsing is working... since there is no ground truth to test against. The greatest difficultly is likely in parsing node attributes.

It would be good not to be the only maintainer of the parser and for there to be a package that makes use of it.

I'm hesitant to switch obonet to use biosustain/obo's parsing, since I think your method is more complex (at least in lines of code). But if we find out that my approach is failing in ways that are not easy to rectify, I'd consider making biosustain/obo a dependency.

As @cmungall pointed out, there's additional networkx logic that would be nice (specifying which relationship types you want to traverse). For this work, I agree that we should find a way not to duplicate efforts. If biosustain/obo and obonet produce the same networkx data structure and are tested against each other, that will be easy.

lyschoening commented 7 years ago

@dhimmel I would not want to add networkx to biosustain/obo. After all, this is your work. The code is in this issue and is straightforward to test.

While I like avoiding duplication of efforts, I think there'd be a valuable testing motivation. I'd like to see whether both packages produce the same networks. I'm not sure about your experience, but it's been hard to know whether my parsing is working... since there is no ground truth to test against. The greatest difficultly is likely in parsing node attributes.

It is indeed quite difficult to parse tag values, which is why I was hoping to work together on a single parser implementation. Some of the complexity of my parser implementation comes down to trying to parse relevant tags into more accessible formats. So if you start looking at other tags such as the remaining relationship types, xrefs or subsets, you might find the package quite useful!

I went ahead and implemented a writer for biosustain/obo. Using it on taxrank and SO helped me spot some of the remaining issues in the parser. The result is the identical bit-for-bit, which is a good enough baseline for me although there are certainly still some more issues left to iron out. I will attempt to publish a version of obo this week and will keep an eye on your progress with obonet.

dhimmel commented 7 years ago

The result is the identical bit-for-bit, which is a good enough baseline for me although there are certainly still some more issues left to iron out.

Awesome. The round-trip equivalence does provide a nice degree of confidence. You could even evaluate the round-trip for every OBO from OBOFoundry.

I will attempt to publish a version of obo this week and will keep an eye on your progress with obonet.

Cool. My stance is to give obonet and biosustain/obo some time to mature in the wild. If enough people are using obonet that it makes sense to continue developing it and biosustain/obo become the de facto OBO parser for python, I'd be happy to outsource parsing to biosustain/obo.

cmungall commented 7 years ago

re: ground truth

convert the obo to obographs json using the java here: https://github.com/geneontology/obographs this can be regarded as the reference translation

it should be very straightforward to define a reference translation from obograph-json to networkx

cmungall commented 7 years ago

FYI

We're getting ready to split out our python ontology access library from our flaskrest services layer: https://github.com/biolink/biolink-api/issues/49

We're thinking of a new repo and pypi module something like 'ontobio'. It's fairly different from obonet in that it doesn't even attempt obo parsing (it takes obojson, or it can connect to a SPARQL server and obtain the ontology from there; it can take obo as input but it calls a java program via shell which is a bit icky). It's aligned in that it defers most of its object model to networkx (although it does have some additional capabilities beyond what you would want to do with networkx, e.g. synonyms)

dhimmel commented 7 years ago

@cmungall great can't wait to try out biolink/ontobio.

Ideally, any networkx traversal functionality I write in obonet can also work on an ontobio graph (i.e. similar networkx structures). I'd like there to be an easy transition for migrating from reading OBO files to the JSON files.

Are you planning on updating the readme to document reading a JSON into networkx in Python? If so, I'll wait till then before playing around.

zachary822 commented 5 years ago

I don't know how much interest there is still in this, but I managed to implement a more complete OBO file parser for one of my own projects. It doesn't strictly check the cardinality of tags yet, but is able to parse all known tags down to its components using pyparsing. https://github.com/zachary822/oboparse

cmungall commented 5 years ago

Thanks for the update!

Since the original post, I have been switching code to use @althonos' fastobo parser: https://pypi.org/project/fastobo/

On Thu, Jul 25, 2019 at 9:21 PM Zachary Juang notifications@github.com wrote:

I don't know how much interest there is still in this, but I managed to implement a more complete OBO file parser for one of my own projects. It still doesn't yet strictly check the cardinality of tags, but is able to parse all known tags down to its components using pyparsing. https://github.com/zachary822/oboparse

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dhimmel/obonet/issues/4?email_source=notifications&email_token=AAAMMOJMYBPUMDPWDI7BBWTQBJ3VZA5CNFSM4DFJXTTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD23OOQQ#issuecomment-515303234, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAMMOJ7TPOEFN3MYRPQ6HLQBJ3VZANCNFSM4DFJXTTA .

dhimmel commented 3 years ago

https://github.com/dhimmel/obonet/commit/e22bf7b40763a2cc810d4caee68e6871f5f3820e mentions the pronto_to_multidigraph function in the nxontology package, which uses pronto, which in turn uses fastobo. This stack is nice since it adds support for reading OWL and OBO Graphs JSON files as well as OBO files.

Anyways, obonet seems to work well for reading OBO files and is still probably the simplest solution for generating a networkx.MultiDiGraph from a .obo file. I'm not adding any major new features, but performing light maintenance to keep it up-to-date with Python. Users seem to like it. So closing this issue since major changes to obonet are unlikely.