bioinformatics-centre / kaiju

Fast taxonomic classification of metagenomic sequencing reads using a protein reference database
http://kaiju.binf.ku.dk
GNU General Public License v3.0
272 stars 67 forks source link

addTaxonNames after merging kaiju outputs, which nodes and names files to use ? #124

Closed olivierrue closed 5 years ago

olivierrue commented 5 years ago

I ran kaiju with two different reference indexes (nr_euk and rvdb). I then merged the two outputs. Finally I want to add taxa names to my output with addTaxonNames. However, this tool needs one file for names.dmp and nodes.dmp. Consequently, some taxa are not found if I use the nr_euk files, and other taxa are problematic if I use the rvdb files. How can I merge two different nodes.dmp and names.dmp files to avoid having troubles ?

pmenzel commented 5 years ago

Merging the dmp files sounds troublesome. If there are only few different taxa affected, you can add them manually to the other names.dmp file (and make sure their taxon IDs are also contained in nodes.dmp).

I guess that there are some taxon IDs that are used in the rvdb database, that have since been removed from the current taxonomy files?

olivierrue commented 5 years ago

Not exactly. For example the taxon ID 2559587 is present in all nodes.dmp files downloaded from your web server (mar, rvdb, nr, plasmids, progenomes, refseq, viruses) excepted nr_euk. Is it expected that this taxon ID is not present in nr_euk but present in nr ?

pmenzel commented 5 years ago

The nodes.dmp and names.dmp should be the same for all downloadable kaiju index files with the data 2019-06-25. I just checked the nodes.dmp files in kaiju_db_nr_euk.tgz and kaiju_db_nr.tgz and they are identical.

Did you download the older file for the indexes from 2017-05-16 (called kaiju_index_nr_euk.tgz) instead?

olivierrue commented 5 years ago

I just checked, you're right. I used an old version with different nodes files. Now it's perfect ! Thank you for your help

BrGAl commented 4 years ago

Dear Peter, sorry If I reopen a closed question but I have a somewhat related issue. I've created a merged database between proGenomes v2 and refSeq Fungi, I had no problems with the creation of the index file .fmi I was planning to merge the nodes.dmp and names.dmp files of the two databases but then I ran into this thread and based on your suggestion i just used the nodes.dmp and the names.dmp from kaiju_db_nr_euk.tgz (2019-06-25 version). The warning I get while running kaiju for many taxa is:

Warning: Taxon ID 20845 in database is not contained in taxonomic tree.

But if I check the taxon presence in the 3 respective files, they are present as expected (e.g):

cat ./kaijudb_nr_euk/names.dmp | grep -w "20845" - Downloaded from the webserver 498103 | Xylariaceae sp. BCC 20845 | | scientific name | cat ./kaijudb_refseq_fungi/names.dmp | grep -w "20845" - Downloaded through Kaiju 498103 | Xylariaceae sp. BCC 20845 | | scientific name | cat ./kaijudb_proGenomes/names.dmp | grep -w "20845" - Downloaded from Zenodo: https://zenodo.org/record/3357977#.XjqzHxNKiw4 498103 | Xylariaceae sp. BCC 20845 | | scientific name |

Can you explain the meaning of the warning and do you think my strategy is correct? Thanks a lot for your help!

pmenzel commented 4 years ago

Your grep found the number 20845 in the name of the species, but not in the first column, which is the taxon id. Looking at the current NCBI taxonomy (from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz), we can see that taxon id 20845 is found in delnodes.dmp, i.e. it was deleted.

BrGAl commented 4 years ago

I see, thanks a lot for your reply. Would you rather run the analysis twice with the respective nodes.dmp and names.dmp and merge the outputs afterwards?

pmenzel commented 4 years ago

yeah, that's probably the most painless way