bioinformatics-centre / kaiju

Fast taxonomic classification of metagenomic sequencing reads using a protein reference database
http://kaiju.binf.ku.dk
GNU General Public License v3.0
260 stars 68 forks source link

addTaxonNames – Uneven taxonomic hierarchies due to missing high-level taxonomic information #11

Closed glajoie1 closed 7 years ago

glajoie1 commented 7 years ago

When using updated code allowing for choosing specific taxonomic levels in addTaxonNames (e.g. addTaxonNames -r phylum,class,order, family, genus, species), I obtain the correct set of taxonomic levels, but comparison of hierarchies among taxa is often incoherent because some high taxonomic levels (e.g. order, family) are missing or are unresolved in many taxa that still have proper genus and species names.

Example of kaiju output: C M02360:6:000000000-AC61R:1:1101:23278:4875 59803 12 59803, GDAPLFPFGYGL, Proteobacteria; Alphaproteobacteria; Sphingomonadales; Sphingomonadaceae; Sphingomonas; Sphingomonas echinoides; C M02360:6:000000000-AC61R:1:1101:8983:4922 360054 11 360054, KILVHGHRGAR, Acidobacteria; Solibacteres; Solibacterales; Bryobacter; Bryobacter aggregatus;

Here Sphingomonas echinoides is fully resolved, while Bryobacter aggregatus lacks a family. (The GenBank taxon file specifies "unclassified Solibacterales" in lieu of family with a label "no rank".) As a result, a given position in the taxonomic names vectors may indicate uncomparable taxonomic levels among sequences.

Following on an initial suggestion by skembel (issue #4), would it be possible to output a NA value in the output of taxonomic levels when a level is missing, such that annotations performed up to a same taxonomic level will be comparable when parsed into columns?

pmenzel commented 7 years ago

Hello, you are right, in that all taxon paths should be the same lengths. I will try to fix it and add NA for missing levels.

pmenzel commented 7 years ago

Hi, please test if the issue is now resolved for your data with the new commit. Thanks!

glajoie1 commented 7 years ago

It works perfectly - thank you for your software!