Closed glajoie1 closed 7 years ago
Hello, you are right, in that all taxon paths should be the same lengths. I will try to fix it and add NA for missing levels.
Hi, please test if the issue is now resolved for your data with the new commit. Thanks!
It works perfectly - thank you for your software!
When using updated code allowing for choosing specific taxonomic levels in addTaxonNames (e.g. addTaxonNames -r phylum,class,order, family, genus, species), I obtain the correct set of taxonomic levels, but comparison of hierarchies among taxa is often incoherent because some high taxonomic levels (e.g. order, family) are missing or are unresolved in many taxa that still have proper genus and species names.
Example of kaiju output: C M02360:6:000000000-AC61R:1:1101:23278:4875 59803 12 59803, GDAPLFPFGYGL, Proteobacteria; Alphaproteobacteria; Sphingomonadales; Sphingomonadaceae; Sphingomonas; Sphingomonas echinoides; C M02360:6:000000000-AC61R:1:1101:8983:4922 360054 11 360054, KILVHGHRGAR, Acidobacteria; Solibacteres; Solibacterales; Bryobacter; Bryobacter aggregatus;
Here Sphingomonas echinoides is fully resolved, while Bryobacter aggregatus lacks a family. (The GenBank taxon file specifies "unclassified Solibacterales" in lieu of family with a label "no rank".) As a result, a given position in the taxonomic names vectors may indicate uncomparable taxonomic levels among sequences.
Following on an initial suggestion by skembel (issue #4), would it be possible to output a NA value in the output of taxonomic levels when a level is missing, such that annotations performed up to a same taxonomic level will be comparable when parsed into columns?