DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
246 stars 73 forks source link

Warning: taxomony id doesn't exists for NZ_AJTB01000092.1! #19

Closed sridhar0605 closed 7 years ago

sridhar0605 commented 8 years ago

has any come across this error so far? both my input_sequence file and seqid2taxa.map files has this id, centrifuge-build is still spitting this error out..

fbreitwieser commented 8 years ago

Can you show the relevant parts in your input files? Does the taxa exist in the taxonomy tree, too?

sridhar0605 commented 8 years ago

I have downloaded the latest taxonomy and split it in to names and dump using tar -zxvf taxdump.tar.gz nodes.dmp and names.dmp before that let me brief you about me database I have all the genomic.fna.gz files for bacteria and virus from ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/ I tried using kraken to build a database (it failed badly due to memory issues even on amazon ec2 instance with 2TB of ram) hence I wanted to use centrifuge which suits perfectly for my analysis. So far I have concatenated all the reads in a single fna file and used seqid2taxa.map file from kraken as initial inputs to centrifuge as centrifuge does not download all the bacterial files. I modified the fasta file as per centrifuge requirement just the sequence id and description. I however end up geting this as an error

Warning: taxomony id doesn't exists for NZ_AJTB01000101.1!

and then this too.. Warning: Taxonomy ID 1527292 is not in the provided taxonomy tree (taxonomy/nodes.dmp)! I then used the same nodes.dmp and names.dmp file from kraken output, still no success.

fbreitwieser commented 8 years ago

This record has been removed from the NCBI nucleotide database (http://www.ncbi.nlm.nih.gov/nuccore/NZ_AJTB01000092.1). Usually we detect these cases by missing entries in the taxonomy dump - which I think is the case here. Note that the assembly_summary and taxonomy are not always in sync.

sridhar0605 commented 8 years ago

That is the issue I am not using assembly_summary as my backbone, I am trying to build it with all available sequences plasmid contigs scaffold in all around 42080 species for bacteria and 5654 for viral.

sridhar0605 commented 8 years ago

any solution for this?

mourisl commented 8 years ago

Can you show us the line for NZ_AJTB01000101.1 in the seqid2taxa.map file and lines around it? Is the corresponding tax id (1527292) in the nodes.dmp and names.dmp?

sridhar0605 commented 8 years ago

Since I did not follow your manual online I made my own script and built the seqid2taxa.map (where is used all accession id from fasta header and got tax id from ncbi), and yes @fbreitwieser was right it has been removed from the database. and hence not seen in nodes.dmp. So the next question to ask is how is it still on their refseq website in fasta file. and how do i cater this issue to build centrifuge index?

fbreitwieser commented 7 years ago

The thing is that RefSeq and the taxonomy database are not always at the same state. In Centrifuge the sequences with no mapping get added to the database with taxonomy ID 0 - though maybe we should just skip them. But the database should be built without problems, even if there is missing mapping.