CatalogueOfLife / general

The Catalogue of Life
49 stars 5 forks source link

TaxonID use changes regularly between COL releases? #91

Closed cboettig closed 3 years ago

cboettig commented 3 years ago

Hi team,

First off, just wanted to say thanks for creating and curating this amazing resource. It's clear an immense amount of resources, time and dedication go into COL each year, and I'm amazed by how well you've continued to provide, maintain, extend and reprove this invaluable community resource. If there's anything I can do to help, please let me know.

I'm writing today to seek some clarification on COL's use of dwc:taxonID in the DWCA archive snapshots. (Again, thanks for providing stable archive snapshots in plain-text files and in DWCA format; this is easily the most convenient, flexible format and a a huge win for the user community!) It appears the dwc:taxonID used to identify any given dwc:scientificName changes between releases; for instance: the scientific name "Homo sapiens" was identified in the COL DWCA as taxonID 3048432 in 2020 but as 6MB3T. In previous years it has assumed different integer values; eg. in 2017 monthly snapshot has 27703827, in 2013 annual has: 6850099 (see https://gist.github.com/cboettig/fc19fdb69069c44c2eaedf6089433518 for scripted example).

I recognize that this isn't necessarily "wrong", i.e. the standard at https://dwc.tdwg.org/terms/#dwc:taxonID suggests the ID need not be "global":

An identifier for the set of taxon information (data associated with the Taxon class). May be a global unique identifier or an identifier specific to the data set.

Though a user might expect "specific to a dataset" specific to COL instead of specific to a specific COL release. For example, most users are I think familiar with the idea that NCBI:9606 isn't the same as 9606 in ITIS (i.e. 9606 isn't a global ID either), but are also used to it being stable within 'the NCBI dataset". This raises the risk that users report taxonIDs from COL in publications or published datasets, which are later miss-classified when resolved against the wrong version of COL.

I understand that precisely defining taxonomic species is difficult, as already discussed in #6. However, most serious ontologies already recognize that taxonID is distinct from [dwc:taxonConceptID](https://dwc.tdwg.org/terms/#dwc:taxonConceptID), and the I believe the thorny issues raised in #6 apply mostly to taxonConceptID and not the more mundane taxonID. In practical terms, I think users would be greatly helped by having stable identifiers for taxonomic names used in COL, so that we can easily resolve the taxonomic information of a species that was identified as, say, COL:6850099 in the widely used BIOTime data archives (which I believe are based on 2013 COL) in the current version COL, as is already common practice with the upstream sources like NCBI and ITIS, and derivative sources like OpenTreeTaxonomy. (OpenTree has a nice practice of providing 'alias' ids when an identifier in a previous release is deprecated, usually occurring when it is merged into another identifier).

At very least, it would be great to have a clear statement as to whether taxonIDs will be stable going forward, or if they should never be used outside the context of the data package in which they were originally released. (We get some hint of this from the fact that COL web interface does not let us display, link, or search for the COL taxonID, but instead links us directly to the upstream provider; eg https://www.catalogueoflife.org/data/taxon/6MB3T points us to https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=180092. But it would be great to know for sure if 6MB3T is intended to be stable going forward or not. )

Thanks again!

mdoering commented 3 years ago

Hi @cboettig. Thanks for raising this. We have put out a blog post about the state of stable ids in COL that is worth reading: https://www.catalogueoflife.org/2021/04/14/stable-ids

In short, we have moved to a completely new infrastructure last year and with that assigned new set of identifiers (the short, non integer ones like 6MB3T). We do provide a mapping from old IDs to new ones, see post above.

Those new identifiers are stable across releases and never go away. They are largely name based, but not entirely, see the post. We also still resolve deleted identifiers (e.g. because the species was wrong, duplicated, etc) so can be sure that https://www.catalogueoflife.org/data/taxon/6MB3T will point to Homo sapiens in the future too.

I am not the biggest fan of long, global identifiers as the core identifier that you have to carry around as parameters in your API, query parameters and elsewhere. URI based identifiers tend to be less stable than just local ones thanks to URI rotting. If you know the context of biodiversity and that this is a species identifiers, prefixing them with a short acronym is my preferred solution for providing a unique namespace that still gives short and stable IDs that you can mix with other identifiers from other namespaces, e.g. COL:6MB3T and NCBI:9606. But the true identifier is still the local one. This is obviously and endless debate that even our community has been discussing for decades already.

cboettig commented 3 years ago

Thanks for this, totally made my day to realize the new identifiers are already stable. The mapping and info files are a nice touch as well. And I completely agree about the use of short, stable identifiers with prefixes -- besides, who can resist the elegance of COL:P for Plantae or COL:R for Archaea :smile_cat: