biopragmatics / obo-db-ingest

🗄️ Conversion of biomedical nomenclatures like HGNC to OBO
https://biopragmatics.github.io/obo-db-ingest/
6 stars 1 forks source link

Decouple annotations/associations from the main release obo/owl files #13

Open cmungall opened 9 months ago

cmungall commented 9 months ago

Current pyobo includes annotations (in the sense of GO annotations, not OWL annotations) modeled as relationships (i.e S subClassOf R some O).

An example of this is ec.obo:

[Term]
id: eccode:1.1.1.1
name: alcohol dehydrogenase
is_a: eccode:1.1.1 ! With NAD(+) or NADP(+) as acceptor
relationship: RO:0002327 GO:0004022 ! enables alcohol dehydrogenase (NAD+) activity
relationship: RO:0002351 uniprot:A0A0H2URT2 ! has member ADHE_STRPN
relationship: RO:0002351 uniprot:A0A0H2ZM56 ! has member ADHE_STRP2
[many rows deleted]

This has a number of practical and semantic disadvantages

  1. It bloats the size (ec.obo is 14x bigger with relationships)
  2. Danger of ontological errors (real: the composed products will simply not work in OWL environments unless everything is modeled just so)
  3. Lack of modularity / Harder to recompose into application-specific products (e.g. what if I want EC + just human proteins)
  4. product becomes stale sooner
  5. lack of separation of concerns
  6. For associations it's important to have evidence, provenance. While this can be done with ontology formats using axiom annotation, this can get bulky and awkward. A TSV is simpler and better often
  7. Directionality issues (are links to EC distributed with uniprot? links to uniprot distributed with EC? both?)
  8. Shoreline issues (ec.obo includes all swissprot annotations, but not, say an arguably more useful set like reference proteomes for core species. Why?)
  9. It's broadly understood that distributing annotations and "contingent knowledge" in the ontology and in models like OWL is not a good strategy, see e.g https://doi.org/10.1016/j.yjbinx.2019.100002. See also slides 51 onwards

Instead decouple the associations / annotations / contingent knowledge. Use TSVs without OWL semantics and all its pitfalls. KGX is a good choice. Some associations are better modeled as SSSOM. By all means distribute these as .obo/.owl as well, and by all means distribute merged products too. The key is to focus on the "conceptual coat hanger" as Rector calls it, and allow people to hang their coats as they please.

In practical terms something like this:

This is less work for pyobo/obo-db-ingest overall. Sometimes you can simply say "we are only providing the coat rack today, we may get to the associations later"

cmungall commented 2 weeks ago

This is still a major impediment to reusing the fantastic work in obo-db-ingest.

E.g. here is the latest rhea ingest

image