bioinformatics-centre / kaiju

Fast taxonomic classification of metagenomic sequencing reads using a protein reference database
http://kaiju.binf.ku.dk
GNU General Public License v3.0
260 stars 68 forks source link

addTaxonNames - naming hierarchy inconsistent among sequences #4

Closed skembel closed 8 years ago

skembel commented 8 years ago

When using the addTaxonNames helper function to add the full taxon path to each sequence, the number of elements in the taxon path is inconsistent, making it very difficult to consistently identify a given taxon level when working with the data in other software (e.g. R). For example, to extract all sequences annotated as belonging to a certain family, I would parse the full taxon path into substrings based on the semicolon divider - but the position of 'family' in the full taxon path ranges from the 5th to 10th element depending on the sequence. It would be possible to extract family names by looking for the -aceae ending, but this does not work for genera or for taxonomic levels that do not have a consistent naming convention. Thus it is currently impossible to do this type of analysis where simultaneous information on multiple taxonomic levels is needed.

Similarly, the most precise taxonomic identification is always added as the final element of the full taxon path, so there is always a duplicate element added that makes it difficult to interpret whether a given taxon name is the name assigned at a given rank, or if that is simply the most precise rank available, or is in fact some other type of annotation (for example "environmental samples" is given as a taxonomic rank in some cases).

Would it be possible to add an option to force the full taxon path output from addTaxonNames to include only a given subset of taxonomic levels (e.g. phylum/class/order/family/genus/species), or to force consistency so that a given element within the taxon path always represents a given taxonomic level (e.g. if family is always the 10th element it can be extracted).

And would it be possible to add an option to suppress the duplicate reporting of the most precise taxonomic identification possible as part of the full taxon path? This could still be reported in a separate field but as it stands it is placed at the end of the full taxon path.

Here is an example of the output of addTaxonNames for a dataset analyzed using kaiju and the nr+euk database, the same issue is present when using the complete genomes and nr databases.

C M02360:6:000000000-AC61R:1:1101:22111:5800 1033 341 1033, SLQPNSGAQGEYAGLLAIRAYHRSRGEAHRKVCLIPSSAHGTNPASASMVGMDVVVVACDARGDVDVEDLR, cellular organisms; Bacteria; Proteobacteria; Alphaproteobacteria; Rhizobiales; Bradyrhizobiaceae; Afipia; Afipia; C M02360:6:000000000-AC61R:1:1101:23444:5841 418856 72 418856, PAVLPDDQTYTHR, cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Nevskiales; Sinobacteraceae; Nevskia; Nevskia soli; Nevskia soli; C M02360:6:000000000-AC61R:1:1101:14687:5896 1248916 258 1248916, IEKESSVNEAIDRMRHSATRALLERDDVIIVASVSCLYGIGSVETYSAMTFALK,IEKESSVNEAIDRMRHSATRALLERDDVIIVASVSCLYGIGSVETYSAMTFALK, cellular organisms; Bacteria; Proteobacteria; Alphaproteobacteria; Sphingomonadales; Sphingomonadaceae; Sandarakinorhabdus; Sandarakinorhabdus sp. AAP62; Sandarakinorhabdus sp. AAP62; C M02360:6:000000000-AC61R:1:1101:25907:5899 1262833 73 1262833, DILIIGGGITGLSSAYF,DILIIGGGITGLSSAYF, cellular organisms; Bacteria; Terrabacteria group; Firmicutes; Clostridia; Clostridiales; Clostridiaceae; Clostridium; environmental samples; Clostridium sp. CAG:710; Clostridium sp. CAG:710; C M02360:6:000000000-AC61R:1:1101:17581:5925 1302620 246 1302620, DHAAETGAAIPTEPVVFMKDPSTVVGPFDEVLVPRGSTKTDWEVELGVVIG,DHAAETGAAIPTEPVVFMKDPSTVVGPFDEVLVPRGSTKTDWEVELGVVIG, cellular organisms; Bacteria; Terrabacteria group; Actinobacteria; Actinobacteria; Micrococcales; Cellulomonadaceae; Cellulomonas; Cellulomonas sp. URHD0024; Cellulomonas sp. URHD0024; C M02360:6:000000000-AC61R:1:1101:17429:5938 1016849 244 1016849, YNPETDGKELVRAFRNIPGVETSSVFALNLLQLAPGGHLGRFIIWTSSAFSAL,YNPETDGKELVRAFRNIPGVETSSVFALNLLQLAPGGHLGRFIIWTSSAFSAL, cellular organisms; Eukaryota; Opisthokonta; Fungi; Dikarya; Ascomycota; saccharomyceta; Pezizomycotina; leotiomyceta; Eurotiomycetes; Chaetothyriomycetidae; Chaetothyriales; Herpotrichiellaceae; Exophiala; Exophiala sideris; Exophiala sideris; C M02360:6:000000000-AC61R:1:1101:23427:5942 1760 122 1760,1883,1888,1892,1907,1930,1938,1963,29303,35619,39478,40318,44060,47758,49185,54571,55952,58344,66378,66874,66875,67257,67298,67373,68213,68231,68249,68268,73044,76728,78355,83656,89050,100226,105425,114687,131568,146537,159449,193462,227882,249567,285535,352211,455632,457427,498367,500153,645465,661399,1055352,1078086,1136432,1156844,1157634,1157635,1157637,1172179,1203592,1288080,1288083,1380346,1380770,1428628,1463841,1463850,1463853,1463856,1463857,1463858,1463877,1463888,1463901,1463917,1463920,1463926,1463932,1576605,1580539,1592326,1592327,1592330,1592727,1650571,1736503, GRWKAVVVGSYERGDRAVTVQRLAELADFY, cellular organisms; Bacteria; Terrabacteria group; Actinobacteria; Actinobacteria; Actinobacteria; C M02360:6:000000000-AC61R:1:1101:20712:8466 290425 165 302406,310575, IVVLTPGMYNSAYFEHTFLAQQMGVELVEGKDL,IVVLTPGMYNSAYFEHTFLAQQMGVELVEGKDL,TIVVLTPGMYNSAYFEHTFLAQQMGVELVEGKDL,TIVVLTPGMYNSAYFEHTFLAQQMGVELVEGKDL, cellular organisms; Bacteria; Proteobacteria; Betaproteobacteria; Burkholderiales; Alcaligenaceae; Advenella; Advenella; C M02360:6:000000000-AC61R:1:1101:13179:8502 1391654 91 1391654, RQMIQAHIAKKATATVAAIPVPR,RQMIQAHIAKKATATVAAIPVPR, cellular organisms; Bacteria; Proteobacteria; delta/epsilon subdivisions; Deltaproteobacteria; Myxococcales; Sorangiineae; Labilitrichaceae; Labilithrix; Labilithrix luteola; Labilithrix luteola; C M02360:6:000000000-AC61R:1:1101:16385:8539 1120523 173 1120523, LAPNTSVTCTANYTVTQADVDSGKVTNTATATGTPPTG, cellular organisms; Bacteria; Terrabacteria group; Actinobacteria; Actinobacteria; Streptomycetales; Streptomycetaceae; Streptomyces; Streptomyces sp. MJM8645; Streptomyces sp. MJM8645;

pmenzel commented 8 years ago

When using the addTaxonNames helper function to add the full taxon path to each sequence, the number of elements in the taxon path is inconsistent

Yes, that is due to the various number of intermediate ranks used by the NCBI taxonomy, especially for Eukaryotes. Did you try to use kaijuReport -r family ... on your output? That should give summarized counts for each family.

Otherwise it would be possible to add the name of the rank to its value in addTaxonnames, e.g. in brackets after the rank, e.g. ...; Sphingomonadaceae [family]; ...

Similarly, the most precise taxonomic identification is always added as the final element of the full taxon path, so there is always a duplicate element added

That is certainly a bug and will be changed.

"environmental samples" is given as a taxonomic rank in some cases

that is correct, because some database sequences are assigned to artificial taxa called "environmental samples" in the NCBI taxonomy: example

skembel commented 8 years ago

Yes, that is due to the various number of intermediate ranks used by the NCBI taxonomy, especially for Eukaryotes. Did you try to use kaijuReport -r family ... on your output? That should give summarized counts for each family.

Yes the kaijuReport -r family option works great but does not include information about the taxonomic hierarchy for each family, so doing something like taking the subset of families that are fungi is not straightforward.

Otherwise it would be possible to add the name of the rank to its value in addTaxonnames, e.g. in brackets after the rank, e.g. ...; Sphingomonadaceae [family]; ...

This would be useful. Even better would be to offer an option to output only a given set of named taxonomic ranks, for example phylum/class/order/family/genus/species for each sequence. This is the format used for taxonomic classification output from many amplicon analysis pipelines (e.g. QIIME, mothur) and would make it simple to work with the output of kaiju using tools adapted for those pipelines.

Thanks for writing the useful software and for the response.

pmenzel commented 8 years ago

Even better would be to offer an option to output only a given set of named taxonomic ranks, for example phylum/class/order/family/genus/species for each sequence.

that's also possible, I will try to implement it soon!

pmenzel commented 8 years ago

I add the option -r to addTaxonNames, which takes a comma-separated set of rank names, so that only the named ranks are printed.

For example:

addTaxonNames -r phylum,class,order,species,genus,family

will print for example:

... Bacteroidetes; Sphingobacteriia; Sphingobacteriales; Chitinophagaceae; Niastella; Niastella koreensis;
... Firmicutes; Clostridia; Clostridiales; Clostridiaceae; Clostridium; Clostridium botulinum; 
... Proteobacteria; Gammaproteobacteria; Alteromonadales; Alteromonadaceae; Alishewanella; Alishewanella sp. WH16-1;