biopragmatics / obo-db-ingest

🗄️ Conversion of biomedical nomenclatures like HGNC to OBO
https://biopragmatics.github.io/obo-db-ingest/
5 stars 1 forks source link

Document the governance strategy for rendering databases as OBO #11

Open cmungall opened 7 months ago

cmungall commented 7 months ago

Note: it is acceptable to close this and say this project is not intended to represent consensus mapping of databases to OBO. I think that would be a wasted opportunity and we should be moving this towards a community standard.

When OBO-izing a database, there are many decisions that are made, either explicitly or implicitly. These have long term consequences for us.

Some of these decisions can be punted elsewhere; identifiers can largely be punted to bioregistry.

As far as general principles, I believe that here there is a general best-effort (if unstated) to conform to OBO principles. However, as per slide 43 from my databases as OBO deck I don't think it makes sense to blindly apply OBO principles to OBOized databases. I think there are probably many lessons learned from existing efforts in the obo-db-ingest project that could be explicitly articulated for parallel guidelines.

I think there are some very practical OBO principles that should be adhered to, such as: entries should as far as possible have labels: https://github.com/biopragmatics/pyobo/issues/169

There are modeling decisions that are made that are potentially very impactful in terms of constraining how the OBO products are used. For example, when using a RO relation to link two entities (e.g. https://github.com/pyobo/pyobo/pull/168#issuecomment-1918097635) this basically injects a superclass into both sides of the relation. You are making a statement on behalf of the resource about what kind of thing they represent. This is a form of axiom injection.

Of course, this is necessary to some extent to make the resulting OBO usable. And we already do this to some extent in biolink. E.g. these are what biolink considers acceptable ID prefixes for a Gene: https://biolink.github.io/biolink-model/Gene/#valid-id-prefixes

These will be some of the harder ones (just look at the COB repo, and the dreaded D** discussion). But I think we should be very practical here and not get bogged down.

However, it would be good to have some decision process - it's good to move fast but we don't want to be building up technical debt