althonos / pronto

A Python frontend to (Open Biomedical) Ontologies.
https://pronto.readthedocs.io
MIT License
228 stars 47 forks source link

What relationships are used to create subclasses/superclasses? #119

Open dhimmel opened 3 years ago

dhimmel commented 3 years ago

One nice aspect of pronto is that it seems able to read most ontologies using a consistent API, even though ontologies don't always follow the same standards. As seen in these docs, it is possible to get the parents and children for a term like:

# parents
term.superclasses(distance=1, with_self=False)
# children
term.subclasses(distance=1, with_self=False)

What relationships are traversed when detecting sub/super-classes? For example, would these OBO relationships contribute to sub/superclasses: is_a, has_subclass, has_part? What about the RDF relationship of rdfs:subClassOf? Does each ontology define what types of relationships (i.e. predicates) constitute sub/superclasses? Or does pronto use the same hardcoded relationships for every ontology?

Is there a way to configure which relationships are indicative of sub/superclasses?

I did see that is_a and has_subclass might be special as per:

https://github.com/althonos/pronto/blob/8e946ef746a7a2999d32e266ed1996e87659ad69/pronto/ontology.py#L450

dhimmel commented 3 years ago

A little more context here, we've created the nxontology package, which is a networkx-based representation of ontologies in Python. nxontology currently focuses on:

  1. computing instrinsic similar scores between pairs of terms
  2. producing visualizations that show the ontology heirarchy

We're currently missing import functionality and would like to allow users to import any OWL/OBO/OBO Graphs JSON ontology file. We have private code to import EFO, MeSH, and GO, and are interested in generalizing this. Pronto seems like the best option for a tool that can read in a variety of ontologies with little-to-no custom code required. So I want to make sure I understand how Pronto is able to generalize the sub/super-class relationship across ontologies.

althonos commented 3 years ago

Hi @dhimmel ,

So basically, in older versions of pronto (pre v1), I was considering the is_a OBO clause as a relationship, and well as the part_of to build the subclasses, but it really wasn't properly following the semantics.

In v1 and v2, the interpretation of is_a clauses is much different, and follows the OBO 1.4 semantics, which maps it exactly to the OWL SubClassOf relationship, which in turn maps exactly to the rdfs:subClassOf. I should remove the documentation line you included, since it is outdated.

So, to answer your questions:

Does each ontology define what types of relationships (i.e. predicates) constitute sub/superclasses? Or does pronto use the same hardcoded relationships for every ontology? Is there a way to configure which relationships are indicative of sub/superclasses?

Only the is_a OBO clause in taken into account when building the sub/superclasses.

althonos commented 3 years ago

If you need some extra references to implement something semantically correct, you should have a look at:

dhimmel commented 3 years ago

Only the is_a OBO clause in taken into account when building the sub/superclasses.

Got it. Makes sense to adopt the semantically correct behavior.

references to implement something semantically correct

Thanks for the links. From the OBO syntax link:

An ontology is a DAG if the graph formed by its logical relationships does not contain any cycles. Formally: let V be a set of nodes corresponding to all Term frames in the ontology (OWL Classes). Let E be a set of edge pairs A,B where both A and B are in V and either

  • A is_a B (i.e. SubClassOf(A B)) or
  • relationship: R B (i.e. SubClassOf(A ObjectSomeValuesFrom(R B)))

Looking at the Gene Ontology relation docs, it seems like they are treating part_of along the lines of subClassOf.

A parent refers to the node closer to the root(s) of the graph, and a child to that closer to the leaf nodes; for the relations is_a and part_of, the parent would be a broader GO term, and the child would be a more specific term

It is safe to use part of to group annotations. For example if a gene product X is annotated as located in the inner mitochondrial membrane and the ontology records a part of relation between inner mitochondrial membrane and mitochondrion, we can safely conclude that X is located in a mitochondrion.

So I am thinking that some user applications might require treating additional clauses beyond is_a as subclass relations. Looks like users can access other relations besides is_a via Term.objects. So perhaps for nxontology, we might have special cases for some ontologies like GO to include additional relations as subclass-like.

dhimmel commented 2 years ago

Based on user feedback in https://github.com/related-sciences/nxontology/issues/14, I'm looking to create networkx graphs with additional relationship types besides "is a", while still benefiting from pronto's readers. Here's some prototype code:

from pronto import Ontology
go = Ontology(handle="http://release.geneontology.org/2021-02-01/ontology/go-basic.json.gz")

# get example node
source = go.get_term("GO:0048518")
assert source.name == "positive regulation of biological process"

# find all relationships for node
template = "{} -- {} --> {}"
for target in source.superclasses(distance=1, with_self=False):
    print(template.format(source.id, "is a", target.id))
    edges.append((source.id, "is a", target.id))
for rel_type, targets in source.relationships.items():
    for target in sorted(targets):
        print(template.format(source.id, rel_type.name, target.id))

Outputs:

GO:0048518 -- is a --> GO:0050789
GO:0048518 -- positively regulates --> GO:0008150

Some notes. This code is hardcoded to assume that pronto only parses "is a" relationships to create term superclasses.

There is some internal pronto code that creates a networkx MultiDiGraph, although it doesn't appear to include "is a" relationships:

https://github.com/althonos/pronto/blob/1909ee95fd9908be68bc0c5d15733a1f13f195e6/pronto/term.py#L217-L229

cmungall commented 2 years ago

Just came across this

a quick clarification:

OBO 1.4 semantics, which maps it exactly to the OWL SubClassOf relationship, which in turn maps exactly to the rdfs:subClassOf

this is true, but to be precise, obo format is_a maps to subClassOf between two named classes

  1. A is_a: B <=> A SubClassOf B, Class(A), Class(B)
  2. A relationship: R B <=> A SubClassOf R some B, Class(A), Class(B), ObjectProperty(R)

(this is a bit of a simplication and avoids GCIs or some relationships that are mapped to annotation triples)

It is nearly two decades since we came up with obo syntax and this mapping to OWL (then DAML). It was a bit of a compromise between a typical user's view and OWL/DL semantics. Dividing things up in this way had certain benefits for managing ontologies.

However, for a typical end-users view we often want to abstract over the differences between 1 and 2. A very common mistake people make is writing code for predominantly is-a-based ontologies (e.g. phenotype) and applying to GO, anatomy ontologies, and missing the crucial part-ofs. (this is much more common if people are wrestling directly with rdf/owl representations where crucial part-ofs are obfuscated in strange terminology like existential restrictions and blank nodes)

For obojson, which attempted to take the good abstractions people liked from obo and conflate between 1 and 2 in the edge objects

I have been trying to socialize this view more widely beyond oboformat users with https://github.com/cmungall/owlstar but have not had much uptake yet.

I think the decision to switch from an obographview to a obof1.4 one in v2 is a good one for having a clear transparent relationship between the datamodel and the format. But I think the conflation we see in obojson and nxontology are more like what some users expect.

perhaps a good compromise is to add some convenience methods or flags to pronto?

For example t.relationships(include_isa=True)

_:param include_isa: [default false] if true include isa in the list of relationships. This creates a Relationship object for rdfs:subClassOf

or a separate method, e.g. t.edges?

or perhaps just add an example to the README - it's not super obvious right now that to get all edges it's necessary to combine relationships plus superclasses(distance=1)

Another nuance here is that formally distance=1 isn't guaranteed to match asserted superclasses. This should be the case in all released versions of ontologies in OBO, but it might not be guaranteed for an edit version of an ontology. E.g. I may wish to asserted redundant grandparents and annotate the axiom with different audience. Here distance would be 2, but it would still be asserted. It would be useful if pronto could fetch the asserted superclasses too.

Massive thanks to you both for making awesome incredibly useful libraries and being patient with us ontology people who have made things more complicated than they need to be!