Currently, the pipeline uses NCBI's rankedlineage.dmp taxdump file for taxonomic standardisation, which only has the ranks kingdom, phylum, class, order, family, genus and species. However, the full taxonomic lineages of most NCBI sequences contain ranks outside of these 7, so some taxonomic information is being lost.
Additionally, BOLD's taxonomy system includes subfamily, tribe and subspecies as ranks for some sequences, and for some sequences, their lowest assignment is to one of the former two ranks. Fitting these into the 7 ranks would mean a sequence identified down to tribe would be forced to only be assigned to family level, which is significantly higher.
In theory you could just increase the number of ranks for the database, but as many sequences are unassigned at, for example, tribe, you could either:
create a dummy taxon for the highest assignment level below the rank: e.g. tribe = NA, genus = Drosophila becomes --> tribe = tribe_Drosophila, genus = Drosophila
create an 'unassigned' label for all sequences at a rank that is missing: e.g. all sequences without a tribe assignment get tribe = unassigned regardless of their higher taxonomy
Both solutions raise their own issues for taxonomic assignment. Can IDTAXA or other assignment methods handle missing ranks or dummy taxa? It's unclear although has been discussed in passing: https://github.com/benjjneb/dada2/issues/853
Currently, the pipeline uses NCBI's
rankedlineage.dmp
taxdump file for taxonomic standardisation, which only has the rankskingdom
,phylum
,class
,order
,family
,genus
andspecies
. However, the full taxonomic lineages of most NCBI sequences contain ranks outside of these 7, so some taxonomic information is being lost.Additionally, BOLD's taxonomy system includes
subfamily
,tribe
andsubspecies
as ranks for some sequences, and for some sequences, their lowest assignment is to one of the former two ranks. Fitting these into the 7 ranks would mean a sequence identified down totribe
would be forced to only be assigned tofamily
level, which is significantly higher.In theory you could just increase the number of ranks for the database, but as many sequences are unassigned at, for example,
tribe
, you could either:tribe
=NA
,genus
=Drosophila
becomes -->tribe
=tribe_Drosophila
,genus
=Drosophila
tribe
=unassigned
regardless of their higher taxonomyBoth solutions raise their own issues for taxonomic assignment. Can IDTAXA or other assignment methods handle missing ranks or dummy taxa? It's unclear although has been discussed in passing: https://github.com/benjjneb/dada2/issues/853