biopragmatics / bioregistry

📮 An integrative registry of biological databases, ontologies, and nomenclatures.
https://bioregistry.io
MIT License
108 stars 47 forks source link

Resolve ncbitaxon on primary provider website #1044

Open bgyori opened 5 months ago

bgyori commented 5 months ago

In most applications it would be useful to resolve ncbitaxon IDs to the NCBI's website, e.g., https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9606 as the primary provider of these IDs. Currently, https://bioregistry.io/ncbitaxon:9606 first resolves to http://purl.obolibrary.org/obo/NCBITaxon_9606 and then to https://ontobee.org/ontology/NCBITaxon?iri=http://purl.obolibrary.org/obo/NCBITaxon_9606, a third party provider. I suspect that the choice of using purl here is motivated by URI-based identification rather than web-based resolution concerns. Still, could we make the NCBI website the default resolver?

cthoyt commented 2 months ago

This has come up many times and I have just spent some more time thinking about this. There are a few possible solutions:

  1. Simply change the uri_format in the NCBI Taxonomy record. This will get the job done, but have the drawback that the default exported Bioregistry prefix map will then have a non-OBO PURL in it. In the past, not having OBO PURLs show up in all places has been a point of friction for adoption by the OBO community, and changing this would probably deteriorate trust
  2. Update the configuration of the OBO PURL system. That's external to Bioregistry, and I'm not sure what the consequences would be
  3. Hack in a field for URL resolution similar to uri_format to the Bioregistry data model that is only considered during resolution. This might also motivate having a dichotomy between functions for getting IRIs and for getting URLs that bake in some assumptions about what qualities the results have. This will increase complexity for both curators and maintainers to understand the data model, and decide where this value should get considered
  4. More carefully extend the provider data model to incorporate annotations on whether a URI format string is meant for RDF, for resolution, or both. This will be quite a bit of effort, as it appears 803/1,768 (45.4%) records in the Bioregistry have explicit URI format string annotations.

Code that counts the number of URI format string annotations:

import bioregistry

total = len(bioregistry.resources())
count = sum(r.uri_format is not None for r in bioregistry.resources())
print(f"There are {count}/{total} ({count/total:.1%}) records with explicit URI format strings")
matentzn commented 2 months ago

If NCBI could way in, we could probably change the resolver of the OBO PURL to NCBI resource.. Its a bit awkward as some people might expect information about the the ontology when looking up this information, but probably its ok.

What is the concern to do the same as done for NCIT?

"uri_format": "https://ncit.nci.nih.gov/ncitbrowser/ConceptReport.jsp?dictionary=NCI%20Thesaurus&code=$1",
"uri_format_rdf": "http://purl.obolibrary.org/obo/NCIT_$1"

Is it that tooling (curies) does not respect the uri_format_rdf slot?