Ontology handling update required

jeromekelleher commented 8 years ago

The handling of the ontology term mapping in the sql reqo is currently quite brittle, and needs some attention. The most important issue that we need to understand is where the ontology term mapping file comes from, and how we version it. Ideally, I think it would be good if we could get rid of the intermediate text file, and create the term name -> ID mapping as a table in the DB.

Once this is done, we should set up a proper relational link between variant sets and ontologies using foreign key constraints, with a properly designed table.

Any thoughts or discussion on this would be much appreciated.

sarahhunt commented 8 years ago

@jeromekelleher - the SO is available here: https://github.com/The-Sequence-Ontology/SO-Ontologies/blob/master/so-xp.obo

There is a python OBO parser. There is a data-version date in the file, but I'm not sure how this relates to the version number on the browser: http://www.sequenceontology.org/browser/obob.cgi

@keilbeck manages the SO so would be able to advise on versioning.

jeromekelleher commented 8 years ago

Thanks a million @sarahhunt, this is hugely useful. I'm sure we can build solid ontology support based on this!

jeromekelleher commented 8 years ago

Here's a snipped using the ontology library above:

from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import orangecontrib.bio.ontology

filename = "so-xp.obo"
ontology = orangecontrib.bio.ontology.OBOOntology(filename)
for j, term in enumerate(ontology.terms()):
    print(term.id, term.name)
    if j == 10:
        break

print()
print("Looking up term names...")
name = "sequence_variant_affecting_polypeptide_function"
term = ontology.term_by_name(name)
print(term)

Gives

SO:0000000 Sequence_Ontology
SO:0000001 region
SO:0000002 sequence_secondary_structure
SO:0000003 G_quartet
SO:0000004 interior_coding_exon
SO:0000005 satellite_DNA
SO:0000006 PCR_product
SO:0000007 read_pair
SO:0000008 gene_sensu_your_favorite_organism
SO:0000009 gene_class
SO:0000010 protein_coding

Looking up term names...
[Term]
id: SO:1000117
name: sequence_variant_affecting_polypeptide_function
comment: OBSOLETE: This term was deleted as it conflated more than one term. The alteration is separate from the effect.
synonym: "mutation affecting polypeptide function" RELATED []
synonym: "sequence variant affecting polypeptide function" EXACT []
is_obsolete: true
replaced_by: SO:0001554

Any reason not to use this directly in the server? All we need really is term name to ID mapping, which the library does directly from a well known format.

jeromekelleher commented 8 years ago

My proposal here would be

Update the Ontology table in the DB so that it has an integer primary key, and add a sequenceOntology column to the SequenceAnnotationSet and FeatureSet tables with foreign key constraints.
Discard our current datamodel/ontotologies.py module, and use the OBOOntology class instead and use a standard .obo ontology file as the backing store.

That should make it possible to have an easily documented recipe for supporting ontologies when using SequenceAnnotationSets and FeatureSets. Any objections to this plan?

diekhans commented 8 years ago

would obo loaded into a table?

This sounds very reasonable; although we should take a look at what is takes to existing ontology package and get true ontology lookups. If we are important the ontologies, we are there data wise.

words of wisdom from @cmungall would be appreciated

Jerome Kelleher notifications@github.com writes:

My proposal here would be

Update the Ontology table in the DB so that it has an integer primary key, and add a sequenceOntology column to the SequenceAnnotationSet and FeatureSet tables with foreign key constraints.

Discard our current datamodel/ontotologies.py module, and use the OBOOntology class instead and use a standard .obo ontology file as the backing store.

That should make it possible to have an easily documented recipe for supporting ontologies when using SequenceAnnotationSets and FeatureSets. Any objections to this plan?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub*

jeromekelleher commented 8 years ago

I don't think there's much point in loading obo into a table --- the parser reads in the full sequence ontology file I got in less than a second, and I'm going to assume that they have a sensible algorithm for looking up names/IDs. If we load the name/ID map into a table, we're basically going to replicate this, and then have to re-ingest the data when we realise that we've missed some important aspect of the ontology data.

cmungall commented 8 years ago

I'm lacking the full context but it sounds like parsing the obo into an in-memory lookup table is sufficient for your purposes. If your use case expands to include graph closure (e.g. 'RNA' and all subtypes) or search (e.g. matching based on synonyms as well as primarily labels) the orange ontology library should suffice. Of course, omitting it from the sql schema will hamper your ability to do joins and generally do certain kinds of operations in a uniform fashion, but YMMV here.

If you do ever need to model the ontology in a sql schema there are some basic patterns and anti-patterns to follow, a good place to start is http://gmod.org/wiki/Chado_CV_Module#Tables

jeromekelleher commented 8 years ago

Thanks @cmungall, this is very useful. I'm hoping the in-memory model the orange library gives us will keep us going until we have more requirements (which are quite simple at the moment). Thanks very much for the schema links, this will be very useful in the future I'd imagine!

@diekhans, what do you think? You have a better idea of where our ontology support is going; should we use the library directly for now (very quick to implement), or go straight for a full SQL ontology model?

diekhans commented 8 years ago

perfect; table in memory :-)

Jerome Kelleher notifications@github.com writes:

I don't think there's much point in loading obo into a table --- the parser reads in the full sequence ontology file I got in less than a second, and I'm going to assume that they have a sensible algorithm for looking up names/IDs. If we load the name/ID map into a table, we're basically going to replicate this, and then have to re-ingest the data when we realise that we've missed some important aspect of the ontology data.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub*

diekhans commented 8 years ago

@jeromekelleher it's a tactical implementation decision, of which you guys are the best judge.

We do need to move towards doing full ontology queries in the near future. It's kind of the point of representing data using ontologies.

If it was me, I would spend a small about of time looking at the library Chris recommended. If it looks like a small amount of additional work gear for ontology queries, then the inclination would be to do it. If it's a larger amount of work, I would go with the quick solution that is going to be thrown away.

Obvious problems is complete lack of metrics how much works is involved. However, writing code is the easy part, knowing what to do is much harder. Fine to do quick and dirty as long as one knows change will come.

Jerome Kelleher notifications@github.com writes:

Thanks @cmungall, this is very useful. I'm hoping the in-memory model the orange library gives us will keep us going until we have more requirements (which are quite simple at the moment). Thanks very much for the schema links, this will be very useful in the future I'd imagine!

@diekhans, what do you think? You have a better idea of where our ontology support is going; should we use the library directly for now (very quick to implement), or go straight for a full SQL ontology model?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub*

jeromekelleher commented 8 years ago

Tactically, I think we should use the orangecontrib.bio.ontology --- it's a huge improvement over what we have, and the changes required are trivial.

jeromekelleher commented 8 years ago

OK, here's a quick way forward for this:

Update the ontology table to have a integer primary key id, and another string field class (or something like this?), which contains things like "SO", "GO", following the usual convention within ontologies.
Update our existing OntologyTermMap class to use the orangecontrib.bio.ontology class to do all the actual lookups.
Change the add-ontology command to expect and OBO ontology file.
Create a field in the DB tables for VariantAnnotationSet and FeatureSet called sequenceOntologyId with the usual foreign key constraints. When we add a VariantAnnotationSet we automatically look for an ontology with class == "SO". If exactly one exists, we use this. If there is none, we error out and ask the user to add a sequence ontology. If more than one sequence ontology exists we ask the user to choose one by supplying the ontology name.

This should be reasonably robust I think.

diekhans commented 8 years ago

Nice! go for it or pass it to David.

Jerome Kelleher notifications@github.com writes:

Tactically, I think we should use the orangecontrib.bio.ontology --- it's a huge improvement over what we have, and the changes required are trivial.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub*

cmungall commented 8 years ago

Minor terminological point: class is used in the ontology and W3C world to mean what we're calling term here. So I would call your new field something like ontology or idspace or ontologyPrefix

jeromekelleher commented 8 years ago

Fantastic, thanks @cmungall, this is exactly the type of stupid mistake we want to avoid making. ontologyPrefix it is!

jeromekelleher commented 8 years ago

Ah, there's the catch. It turns out that the Orange Bioinformatics package here depends on Orange, which depends on scipy... These are pretty hefty dependencies, which it would be nice to avoid. Having said that, there doesn't really seem to be any other Python obo parsers out there, and big dependencies are becoming less of an issue with pip caching builds, and things like bioconda.

So, I think the options are:

Use this library, and accept the dependencies.
Write our own simple OBO parser. For what we actually want right now (name<->ID maps), this would be pretty easy I think.
Revert to the text file of (name, ID) tuples which the administrator needs to magic up from somewhere.

@diekhans, what do you think?

cmungall commented 8 years ago

There is also https://github.com/tanghaibao/goatools/blob/master/goatools/obo_parser.py

But I think for what you want, 2 is fine.

jeromekelleher commented 8 years ago

Thanks @cmungall --- how specific to Gene Ontology is this parser do you think? I see a lot of GO stuff in there.

cmungall commented 8 years ago

To be honest I haven't done a full evaluation, the class names appear to sound GO specific, but look reasonably generic. The id2int method has a dangerous assumption, but isn't called.

The lack of a really rigorous and generic python lib has been bothering me for a while. We do a lot of heavy lifting with ontologies using JVM languages and OWL, but of course the majority of the bioinformatics community is using languages like python and obo files. I had the beginnings of a python library intended to be future proof working off of the RDF serialization of OWL, but it turned out rdflib was unfortunately ridiculously slow. I am to return but using a standard JSON/YAML serialization that is both expressive enough for OWL but also caters to the 99% case of doing ID/label lookups etc... but not there yet.

jeromekelleher commented 8 years ago

Thanks @cmungall, I've tried it out and it works well enough to get us started. Let us know if you every get around to the heavy-duty Python ontology package --- it's badly needed!

kozbo commented 8 years ago

@jeromekelleher should we postpone merging this until after we create the official sql pypi build? Will these changes require regression testing the Sequence Annotations?

jeromekelleher commented 8 years ago

The proposed changes are in #1225 @kozbo, we can see what's involved there. The code we currently have is very brittle around ontologies, I would rather not release it.

kozbo commented 8 years ago

@jeromekelleher ok, I am convinced by both you and the two guys that sit next to me :-) @macieksmuga is doing a review

jeromekelleher commented 8 years ago

Closed by #1225.

ga4gh / ga4gh-server

Ontology handling update required #1147