Closed dhoogest closed 4 years ago
Works with taxdmp.zip
files from the ncbi archive up to 2020-06-14
Examining the unique ranks now present in nodes.dmp
, I get this:
['no rank', 'superkingdom', 'genus', 'species', 'order', 'family',
'subspecies', 'subfamily', 'strain', 'serogroup', 'biotype',
'tribe', 'phylum', 'class', 'species group', 'forma', 'clade',
'suborder', 'subclass', 'varietas', 'kingdom', 'subphylum',
'forma specialis', 'isolate', 'infraorder', 'superfamily',
'infraclass', 'superorder', 'subgenus', 'superclass', 'parvorder',
'serotype', 'species subgroup', 'subcohort', 'cohort', 'genotype',
'subtribe', 'section', 'series', 'subvariety', 'morph',
'subkingdom', 'superphylum', 'subsection', 'pathogroup']
Comparing to the 20200614 dump, the following ranks are added:
['strain', 'serogroup', 'biotype', 'clade', 'forma specialis', 'isolate', 'serotype', 'genotype', 'subvariety', 'morph', 'pathogroup']
I've emailed NCBI about these, which look to me all likely below species in the hierarchy but of unclear relationship to eachother. Currently tax_ids with one of the new ranks assigned display as no rank
on NCBI's tax browser, so presumably a new enough change that their systems are also not yet up to date.
Response from NCBI. May as well see if they make an effort on supplying the ordered ranks, as that could determine how to proceed with modifications to ncbi.py
From: NLM Support <nlm-support@nlm.nih.gov>
Sent: Wednesday, June 17, 2020 12:08 PM
To: Dan Hoogestraat <dhoogest@uw.edu>; Noah G. Hoffman <ngh2@uw.edu>; Christopher Rosenthal <crosenth@uw.edu>
Subject: FW: Re: case #CAS-555782-M7V0S6: New taxonomic ranks in taxdmp TRACKING:000271000002206
I will pass your message on to Taxonomy and write again if they have additional comments to share.
------------------- Original Message -------------------
From: Noah Hoffman;
Received: Wed Jun 17 2020 14:44:10 GMT-0400 (Eastern Daylight Time)
To: nlm-support@nlm.nih.gov; Dan Hoogestraat; Inbound - NLM Support; Triage Team;
Cc: Chris Rosenthal;
Subject: Re: case #CAS-555782-M7V0S6: New taxonomic ranks in taxdmp TRACKING:000271000002206
Hi Bonnie,
I definitely think that a hierarchical list of rank names would be helpful – as Dan said we infer the ordering, but an explicit ordering would be preferable. If it's easy to include this as a standard component of the tables downloaded with the taxonomy we'd appreciate it.
Thanks a lot,
Noah
Noah G. Hoffman, MD, PhD
Associate Professor, University of Washington
Director, Informatics Division
Department of Laboratory Medicine
Box 357110
Seattle, WA 98195-7110
ngh2@uw.edu
From: Dan Hoogestraat <dhoogest@uw.edu>
Date: Wednesday, June 17, 2020 at 11:29 AM
To: NLM Support <nlm-support@nlm.nih.gov>
Cc: "Noah G. Hoffman" <ngh2@uw.edu>, Christopher Rosenthal <crosenth@uw.edu>
Subject: Re: case #CAS-555782-M7V0S6: New taxonomic ranks in taxdmp TRACKING:000271000002206
Thank you for the information Bonnie, we’ll look forward to the publication. Our current process constructs a ‘ranks’ table from the set of assigned ranks in the nodes table, so not per-se necessary from your side to provide the ranks table. However I could see where that might be a useful addition to the other *dmp tables for us and other users going forward.
Best,
Dan
___________________________________________
Daniel Hoogestraat, MB (ASCP)
Department of Laboratory Medicine | Molecular Microbiology
University of Washington Medical Center Room NW177
1959 NE Pacific St Seattle, WA 98195
Phone: (206)-598-5735 | Box: 357110
http://depts.washington.edu/molmicdx
On Jun 17, 2020, 11:16 AM -0700, NLM Support <nlm-support@nlm.nih.gov>, wrote:
Hi Dan,
The following is information provided by NCBI Taxonomy staff:
"The NCBI Taxonomy is grounded in phylogenetic systematics but also uses traditional hierarchical ranks first proposed by Linneaus in the 18th century and subsequent ones recognised in the Codes of Nomenclature. A second group of ranks comprise those that are included in NCBI classification out of practical necessity. If the rank matches no existing formally defined rank, then it was previously assigned a "no rank" value in NCBI Taxonomy. We have now made public a number of these in order to provide additional information. The first set is listed in a forthcoming publication (in review), but we can provide a table of all current ranks if it is needed. All new rank labels are hierachical, below species, except for “clade” which can be assigned at any level in the hierarchy. The rank labels should become public as part of the NCBI Taxonomy Browser in due course as well."
I hope that this answers your question.
regards,
Bonnie L. Maidak, Ph.D.
NCBI Help Desk
DHHS/NIH/NLM/NCBI
****
* PLEASE DO NOT MODIFY THE SUBJECT LINE OF THIS EMAIL WHEN RESPONDING TO ENSURE CORRECT TRACKING *
Case Information:
Case #: CAS-555782-M7V0S6
Customer Name: Dan Hoogestraat
Customer Email: dhoogest@uw.edu
Case Created: 2020-06-16T20:11:01Z
Summary: New taxonomic ranks in taxdmp
Details:
Hello ncbi,
As of yesterday (6/15) we began to notice the following new taxonomic ranks appear in the nodes.dmp file provided via the taxonomy ftp site:
'strain', 'serogroup', 'biotype', 'clade', 'forma specialis', 'isolate', 'serotype', 'genotype', 'subvariety', 'morph', 'pathogroup']
I don't see these yet reflected in the Tax browser (nodes there receive 'no rank' when labelled with the new ranks it seems), so am interested in whether these are permanent additions to the representation of ranks in the taxonomy database and if so what their hierarchical relationship to one another and to the previously defined ranks. I believe these are likely all below species ranks.
Thanks in advance!
Dan
___________________________________________
Daniel Hoogestraat, MB (ASCP)
Department of Laboratory Medicine | Molecular Microbiology
University of Washington Medical Center Room NW177
1959 NE Pacific St Seattle, WA 98195
Phone: (206)-598-5735 | Box: 357110
http://depts.washington.edu/molmicdx
Try adding the new ranks to https://github.com/fhcrc/taxtastic/blob/master/taxtastic/ncbi.py#L45
I have an algo that derives an approximate rank order from the data:
superkingdom, kingdom, subkingdom, superphylum, phylum, subphylum, serogroup, genotype, pathogroup, biotype, morph, subsection, series, subvariety, superclass, class, subclass, infraclass, cohort, subcohort, superorder, order, suborder, infraorder, parvorder, superfamily, family, subfamily, tribe, subtribe, genus, subgenus, section, species group, species subgroup, species, clade, subspecies, varietas, forma specialis, forma, serotype, strain, isolate
"section" is interesting because it has a different rank order botanical or zoological. We should probably wait until they publish their paper but if we need something approximate to get new_database to run we can use this order.
Case we get tired of waiting for a response from NCBI, https://github.com/fhcrc/taxtastic/pull/135 incorporates the new ranks and works for me from the lastest NCBI taxdmp in testing.
PR Validation from @nhoffman - "Generate a taxonomy using the previous taxtastic release from a dump file just before the introduction of the error, and one using the RC using the first dumpfile showing the error, and produce a delta of the two taxonomies."
Thank you @dhoogest, @marykstewart, @mwohl and @nhoffman for all their hard work on this Issue and PR.
Ordered rank: rank that follows a defined and predictable hierarchy and appears only once in a lineage
Example: Tree of life: https://en.wikipedia.org/wiki/Taxonomic_rank
Unordered rank: A rank that appears throughout a lineage and can appear 1 or more times in a lineage
Example: no rank
biotype, genotype, morph, pathogroup, serogroup and subvariety
clade, isolate, forma specialis, serotype and strain
Using taxtastic v0.8.11 and a June 14, 2020 NCBI taxononomy dump, a taxtable was constructed and compared to a taxtable constructed using the taxtastic 133-new-ranks PR and an NCBI data dump from June 15, 2020. An inner join of taxonomy ids from the Bacteria lineage was used to compare new and old ranks.
Note: Genotype, morph and subvariety are not included in Bacteria lineages and not part of this validation.
The full taxtable comparison is here: combined_diffs.xlsx
root, superkingdom, kingdom, subkingdom, superphylum, phylum, subphylum, superclass, class, subclass, infraclass, cohort, subcohort, superorder, order, suborder, infraorder, parvorder, superfamily, family, subfamily, tribe, subtribe, genus, subgenus, section, subsection, series, subseries, species_group, species_subgroup, species, subspecies, biotype, genotype, morph, pathogroup, serogroup, varietas, subvariety, forma
Diffs seem reasonable to me. I'm okay with merge and release of 133-new-ranks
Closed with 0.9.0
release
I haven't traced back the exact point when something changed on the NCBI side, however currently attempts to build new dbs using
taxit new_database
from the latesttaxdmp.zip
fail during node update as follows: