Document the governance strategy for rendering databases as OBO

Note: it is acceptable to close this and say this project is not intended to represent consensus mapping of databases to OBO. I think that would be a wasted opportunity and we should be moving this towards a community standard.

When OBO-izing a database, there are many decisions that are made, either explicitly or implicitly. These have long term consequences for us.

Identifiers: do ECs have dashes in them?
Terminological: What gets the primary label: symbol or full name?
Metadata: What kinds of textual information is included, what is excluded? Definitions?
Ontological: What is the relationship between a gene, an EC, and a RHEA? (see this diagram)
Ismorphism: To what extent do we retain isomorphism with source vs introducing additional information that is useful in an OBO context (lots of examples here: https://github.com/obophenotype/ncbitaxon/issues)
Principles / Guidelines: What is best practice? When is OBO followed, when is it not suitable?

Some of these decisions can be punted elsewhere; identifiers can largely be punted to bioregistry.

As far as general principles, I believe that here there is a general best-effort (if unstated) to conform to OBO principles. However, as per slide 43 from my databases as OBO deck I don't think it makes sense to blindly apply OBO principles to OBOized databases. I think there are probably many lessons learned from existing efforts in the obo-db-ingest project that could be explicitly articulated for parallel guidelines.

I think there are some very practical OBO principles that should be adhered to, such as: entries should as far as possible have labels: https://github.com/biopragmatics/pyobo/issues/169

There are modeling decisions that are made that are potentially very impactful in terms of constraining how the OBO products are used. For example, when using a RO relation to link two entities (e.g. https://github.com/pyobo/pyobo/pull/168#issuecomment-1918097635) this basically injects a superclass into both sides of the relation. You are making a statement on behalf of the resource about what kind of thing they represent. This is a form of axiom injection.

Of course, this is necessary to some extent to make the resulting OBO usable. And we already do this to some extent in biolink. E.g. these are what biolink considers acceptable ID prefixes for a Gene: https://biolink.github.io/biolink-model/Gene/#valid-id-prefixes

These will be some of the harder ones (just look at the COB repo, and the dreaded D** discussion). But I think we should be very practical here and not get bogged down.

However, it would be good to have some decision process - it's good to move fast but we don't want to be building up technical debt

biopragmatics / obo-db-ingest

Document the governance strategy for rendering databases as OBO #11