biopragmatics / bioregistry

📮 An integrative registry of biological databases, ontologies, and nomenclatures.
https://bioregistry.io
MIT License
115 stars 51 forks source link

ncbitaxon has the wrong contact (more generally: policy for ontology transformations) #66

Open cmungall opened 3 years ago

cmungall commented 3 years ago

I think @fbastian would be surprised to find himself as the representative for the NCBI Taxonomy

http://bioregistry.io/registry/ncbitaxon

image

This is likely caused by the fact that the ontology transformation and the source database are being treated as the same thing

On the OBO page about the translation: http://obofoundry.org/ontology/ncbitaxon

We take great pains to say that we don't do the awesome work of the content in the OBO rendition, we just do a mechanical transform of what the wonderful curators at NCBI do. And our PURLs are distinct from NCBI URLs. However, outside of semweb/OWL people, most people use IDs, not URIs, and an ID NCBITaxon:nnnn is neutral w.r.t. whether we are talking about the original vs a translation.

And in fact we want to use the same ID and ID space as we are talking about the identical concept. Homo sapiens in homo sapiens whether you want to get information from NCBI or the ontological representation.

(to complicate factors there is also the UMLS 'ontological' rendering, which is what is in bioportal)

I would say we do not in general want a situation like we have in PRO where there are two different IDs for the same concept (PR:P12345 and uniprot:P12345) (some ontologists may want to make the case that these denote different thing, one a class of material entity, the other a database record, but really, it's the same concept).

This situation could get worse if there are more database translations into ontologies.

I think there should be a single canonical entry for the ncbi taxonomy concepts. E.g. http://bioregistry.io/registry/ncbitaxon

The policy should be that it is always the upstream curated source that gets listed as contact/authority (here NCBI)

Then under metaregistry

image

We have a bit more granular representations to say that for example OBO and OLS are resolvers for the same translation of ncbitaxon (the OBO version), even though they resolve to different URLs, it is the same information artefact. We also say the identifiers.org and n2t.net provide the canonical original source version. And if we do want to credit people for OBO translations (which I am not sure we do) it happens at this level rather than forcing minting of new prefixes or conflating things.

By the way, the NCBI taxonomy is one of my favorite examples of ID chaos.

Ecology and systematics informatics people often just use the "NCBI" prefix, e.g. NCBI:7955. And why wouldn't they? NCBI doesn't have anything of interest to them outside taxonomy. While we may wonder if this is a gene ID or something else, there is no ambiguity if your world doesn't include genes

Meanwhile, eukaryote sequence-focused folks often just use 'taxon:7955'. What other taxonomy could there possibly be other than the NCBI one? But in fact if you are in the ecology or systematics space then there are many other taxonomy databases as 'taxon:7955' is ambiguous. For those of us whose work overlaps both communities this can be frustrating!

And even for us bioinformatics people who are not solely eukaryote focused, it's problematic to use 'taxon' as a synonym for NCBI taxon, as for microbiomes GTDB is commonly used (which we should register...)

For better or worse, we smooshed the name when making the ontology transformation, NCBITaxon. Some people took the decision to split this into NCBITaxon, which is not inconsistent with using `as a subnamespace partition... but this causes big problems for OBO PURLs which made the poor decision to use_` in the PURLs...

And more recently, NCBI themselves recommend people use IDs such as NCBI:txid7955 (see https://github.com/obophenotype/ncbitaxon/issues/40), which AFAIK won't resolve on any system, e.g. http://bioregistry.io/registry/ncbi

cthoyt commented 3 years ago

All really good points. You were right to guess - the metadata in the Bioregistry is a mixture of curated content and imported/normalized content from other databases. The priority order is something like Bioregistry > OBO Foundry > OLS > Identifiers.org > N2T > Prefix Commons > ...

I'm not really sure what the metadata model should be to credit both the responsible person for the content but also responsible for the OBO in a different way. For now, I think it makes sense to leave @fbastian in the OBO Foundry since he could solve any problems for people coming at the ontology version of the data, but the Bioregistry isn't specific to semantic web/OWL people so this might be misleading. Do you know a better contact? I get the feeling that most NCBI databases are very opaque with respect to who works on them and could be contacted

If you want to accelerate the discussion about ontology transformations, I'd be glad to submit several high profile databases converted to OBO / OWL to the OBO Foundry and create immense ensuing chaos 😄

cthoyt commented 3 years ago

See https://github.com/pyobo/examples for that pile of database to ontology conversions :)