fhcrc / taxtastic

Create and maintain phylogenetic "reference packages" of biological sequences.
GNU General Public License v3.0
21 stars 10 forks source link

taxit new_database fails during node update for newer NCBI db dumps #133

Closed dhoogest closed 4 years ago

dhoogest commented 4 years ago

I haven't traced back the exact point when something changed on the NCBI side, however currently attempts to build new dbs using taxit new_database from the latest taxdmp.zip fail during node update as follows:

NFO ncbi 394 loading nodes
Traceback (most recent call last):
  File "/Users/dhoogest/environments/p3-env/bin/taxit", line 11, in <module>
    load_entry_point('taxtastic==0.8.11', 'console_scripts', 'taxit')()
  File "/Users/dhoogest/environments/p3-env/lib/python3.7/site-packages/taxtastic/scripts/taxit.py", line 51, in main
    return action(arguments)
  File "/Users/dhoogest/environments/p3-env/lib/python3.7/site-packages/taxtastic/subcommands/new_database.py", line 86, in action
    ncbi_loader.load_archive(zfile)
  File "/Users/dhoogest/environments/p3-env/lib/python3.7/site-packages/taxtastic/ncbi.py", line 397, in load_archive
    self.load_table('nodes', rows=nodes_rows)
  File "/Users/dhoogest/environments/p3-env/lib/python3.7/site-packages/taxtastic/ncbi.py", line 364, in load_table
    cur.executemany(cmd, itertools.islice(rows, limit))
psycopg2.errors.ForeignKeyViolation: insert or update on table "nodes" violates foreign key constraint "nodes_rank_fkey"
DETAIL:  Key (rank)=(strain) is not present in table "ranks".
dhoogest commented 4 years ago

Works with taxdmp.zip files from the ncbi archive up to 2020-06-14

dhoogest commented 4 years ago

Examining the unique ranks now present in nodes.dmp, I get this:

['no rank', 'superkingdom', 'genus', 'species', 'order', 'family',
       'subspecies', 'subfamily', 'strain', 'serogroup', 'biotype',
       'tribe', 'phylum', 'class', 'species group', 'forma', 'clade',
       'suborder', 'subclass', 'varietas', 'kingdom', 'subphylum',
       'forma specialis', 'isolate', 'infraorder', 'superfamily',
       'infraclass', 'superorder', 'subgenus', 'superclass', 'parvorder',
       'serotype', 'species subgroup', 'subcohort', 'cohort', 'genotype',
       'subtribe', 'section', 'series', 'subvariety', 'morph',
       'subkingdom', 'superphylum', 'subsection', 'pathogroup']

Comparing to the 20200614 dump, the following ranks are added:

['strain', 'serogroup', 'biotype', 'clade', 'forma specialis', 'isolate', 'serotype', 'genotype', 'subvariety', 'morph', 'pathogroup']
dhoogest commented 4 years ago

I've emailed NCBI about these, which look to me all likely below species in the hierarchy but of unclear relationship to eachother. Currently tax_ids with one of the new ranks assigned display as no rank on NCBI's tax browser, so presumably a new enough change that their systems are also not yet up to date.

dhoogest commented 4 years ago

Response from NCBI. May as well see if they make an effort on supplying the ordered ranks, as that could determine how to proceed with modifications to ncbi.py

From: NLM Support <nlm-support@nlm.nih.gov>
Sent: Wednesday, June 17, 2020 12:08 PM
To: Dan Hoogestraat <dhoogest@uw.edu>; Noah G. Hoffman <ngh2@uw.edu>; Christopher Rosenthal <crosenth@uw.edu>
Subject: FW: Re: case #CAS-555782-M7V0S6: New taxonomic ranks in taxdmp TRACKING:000271000002206

I will pass your message on to Taxonomy and write again if they have additional comments to share.

------------------- Original Message -------------------
From: Noah Hoffman;
Received: Wed Jun 17 2020 14:44:10 GMT-0400 (Eastern Daylight Time)
To: nlm-support@nlm.nih.gov; Dan Hoogestraat; Inbound - NLM Support; Triage Team;
Cc: Chris Rosenthal;
Subject: Re: case #CAS-555782-M7V0S6: New taxonomic ranks in taxdmp TRACKING:000271000002206

Hi Bonnie,

I definitely think that a hierarchical list of rank names would be helpful – as Dan said we infer the ordering, but an explicit ordering would be preferable. If it's easy to include this as a standard component of the tables downloaded with the taxonomy we'd appreciate it.

Thanks a lot,

Noah

Noah G. Hoffman, MD, PhD

Associate Professor, University of Washington

Director, Informatics Division

Department of Laboratory Medicine

Box 357110

Seattle, WA 98195-7110

ngh2@uw.edu

From: Dan Hoogestraat <dhoogest@uw.edu>
Date: Wednesday, June 17, 2020 at 11:29 AM
To: NLM Support <nlm-support@nlm.nih.gov>
Cc: "Noah G. Hoffman" <ngh2@uw.edu>, Christopher Rosenthal <crosenth@uw.edu>
Subject: Re: case #CAS-555782-M7V0S6: New taxonomic ranks in taxdmp TRACKING:000271000002206

Thank you for the information Bonnie, we’ll look forward to the publication. Our current process constructs a ‘ranks’ table from the set of assigned ranks in the nodes table, so not per-se necessary from your side to provide the ranks table. However I could see where that might be a useful addition to the other *dmp tables for us and other users going forward. 
Best,
Dan

___________________________________________

Daniel Hoogestraat, MB (ASCP)

Department of Laboratory Medicine | Molecular Microbiology

University of Washington Medical Center Room NW177

1959 NE Pacific St Seattle, WA 98195

Phone: (206)-598-5735 | Box: 357110

http://depts.washington.edu/molmicdx

On Jun 17, 2020, 11:16 AM -0700, NLM Support <nlm-support@nlm.nih.gov>, wrote:

Hi Dan,

The following is information provided by NCBI Taxonomy staff:

"The NCBI Taxonomy is grounded in phylogenetic systematics but also uses traditional hierarchical ranks first proposed by Linneaus in the 18th century and subsequent ones recognised in the Codes of Nomenclature. A second group of ranks comprise those that are included in NCBI classification out of practical necessity. If the rank matches no existing formally defined rank, then it was previously assigned a "no rank" value in NCBI Taxonomy. We have now made public a number of these in order to provide additional information. The first set is listed in a forthcoming publication (in review), but we can provide a table of all current ranks if it is needed. All new rank labels are hierachical, below species, except for “clade” which can be assigned at any level in the hierarchy. The rank labels should become public as part of the NCBI Taxonomy Browser in due course as well."

I hope that this answers your question.

regards,
Bonnie L. Maidak, Ph.D.
NCBI Help Desk
DHHS/NIH/NLM/NCBI

****
* PLEASE DO NOT MODIFY THE SUBJECT LINE OF THIS EMAIL WHEN RESPONDING TO ENSURE CORRECT TRACKING *

Case Information:
Case #: CAS-555782-M7V0S6
Customer Name: Dan Hoogestraat
Customer Email: dhoogest@uw.edu
Case Created: 2020-06-16T20:11:01Z

Summary: New taxonomic ranks in taxdmp

Details:

Hello ncbi,

As of yesterday (6/15) we began to notice the following new taxonomic ranks appear in the nodes.dmp file provided via the taxonomy ftp site:

'strain', 'serogroup', 'biotype', 'clade', 'forma specialis', 'isolate', 'serotype', 'genotype', 'subvariety', 'morph', 'pathogroup']
I don't see these yet reflected in the Tax browser (nodes there receive 'no rank' when labelled with the new ranks it seems), so am interested in whether these are permanent additions to the representation of ranks in the taxonomy database and if so what their hierarchical relationship to one another and to the previously defined ranks. I believe these are likely all below species ranks.

Thanks in advance!

Dan

___________________________________________

Daniel Hoogestraat, MB (ASCP)

Department of Laboratory Medicine | Molecular Microbiology

University of Washington Medical Center Room NW177

1959 NE Pacific St Seattle, WA 98195

Phone: (206)-598-5735 | Box: 357110

http://depts.washington.edu/molmicdx
crosenth commented 4 years ago

Try adding the new ranks to https://github.com/fhcrc/taxtastic/blob/master/taxtastic/ncbi.py#L45

crosenth commented 4 years ago

I have an algo that derives an approximate rank order from the data:

superkingdom, kingdom, subkingdom, superphylum, phylum, subphylum, serogroup, genotype, pathogroup, biotype, morph, subsection, series, subvariety, superclass, class, subclass, infraclass, cohort, subcohort, superorder, order, suborder, infraorder, parvorder, superfamily, family, subfamily, tribe, subtribe, genus, subgenus, section, species group, species subgroup, species, clade, subspecies, varietas, forma specialis, forma, serotype, strain, isolate

crosenth commented 4 years ago

"section" is interesting because it has a different rank order botanical or zoological. We should probably wait until they publish their paper but if we need something approximate to get new_database to run we can use this order.

dhoogest commented 4 years ago

Case we get tired of waiting for a response from NCBI, https://github.com/fhcrc/taxtastic/pull/135 incorporates the new ranks and works for me from the lastest NCBI taxdmp in testing.

crosenth commented 4 years ago

New PR https://github.com/fhcrc/taxtastic/pull/137

crosenth commented 4 years ago

PR Validation from @nhoffman - "Generate a taxonomy using the previous taxtastic release from a dump file just before the introduction of the error, and one using the RC using the first dumpfile showing the error, and produce a delta of the two taxonomies."

crosenth commented 4 years ago

Validation

Thank you @dhoogest, @marykstewart, @mwohl and @nhoffman for all their hard work on this Issue and PR.

Definitions

Ordered rank: rank that follows a defined and predictable hierarchy and appears only once in a lineage

Example: Tree of life: https://en.wikipedia.org/wiki/Taxonomic_rank

Unordered rank: A rank that appears throughout a lineage and can appear 1 or more times in a lineage

Example: no rank

New ordered ranks from the June 15 NCBI taxonomy dump:

biotype, genotype, morph, pathogroup, serogroup and subvariety

New unordered ranks:

clade, isolate, forma specialis, serotype and strain

Methods

Using taxtastic v0.8.11 and a June 14, 2020 NCBI taxononomy dump, a taxtable was constructed and compared to a taxtable constructed using the taxtastic 133-new-ranks PR and an NCBI data dump from June 15, 2020. An inner join of taxonomy ids from the Bacteria lineage was used to compare new and old ranks.

Note: Genotype, morph and subvariety are not included in Bacteria lineages and not part of this validation.

The full taxtable comparison is here: combined_diffs.xlsx

New rank order

root, superkingdom, kingdom, subkingdom, superphylum, phylum, subphylum, superclass, class, subclass, infraclass, cohort, subcohort, superorder, order, suborder, infraorder, parvorder, superfamily, family, subfamily, tribe, subtribe, genus, subgenus, section, subsection, series, subseries, species_group, species_subgroup, species, subspecies, biotype, genotype, morph, pathogroup, serogroup, varietas, subvariety, forma

dhoogest commented 4 years ago

Diffs seem reasonable to me. I'm okay with merge and release of 133-new-ranks

dhoogest commented 4 years ago

Closed with 0.9.0 release