benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
468 stars 142 forks source link

Assigning taxonomy with custom prokaryotic/eukaryotic database #1682

Closed bweiler89 closed 4 months ago

bweiler89 commented 1 year ago

I am trying to assign taxonomy using a custom reference database with 8 taxonomic levels, however the taxonomy is having a very tough time assigning correctly. I am using this on a fairly large dataset on our supercomputer which takes roughly 2-3 days to assign.

The code: library(dada2)

ref_fasta <- "/customDBs/customreferenceDB.fa" taxa <- assignTaxonomy(stnochimera, taxLevels = c("Kingdom", "Phylum", "Sub-Phylum", "Class", "Order", "Family", "Genus", "Species"),refFasta=ref_fasta, multithread=TRUE)

However the exported ASVs.fa and ASVs_taxonomy.tsv show very poor taxonomic assignment (255k NAs of ~700k ASVs) where in some cases eukaryotes are being assigned prokaryotic taxonomy... or NAs are easily blasted to Endozoicomonas sp.

Here's the format of the reference database (where all sequences are one line after header, headers include a subphylum for eukaryotic formatting):

Bacteria;Proteobacteria;Proteobacteria_Z;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas_amygdali; AACTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGC... Bacteria;Proteobacteria;Proteobacteria_Z;Gammaproteobacteria;Enterobacterales;Pectobacteriaceae;Dickeya;Dickeya_sp.; AGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGCAGC... Bacteria;Actinobacteria;Actinobacteria_Z;Actinomycetota;Actinomycetales;Actinomycetaceae;Actinomycetaceae_X;unidentified; GACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGAGTGGCGAACGGGTGAGTAATACGT... Bacteria;Firmicutes;Firmicutes_Z;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus_equi; GCCTAATACATGCAAGTTGACGACAGATGATACGTAGCTTGCTACAATTATCTGTAGTCGAACGGGTG... Bacteria;Firmicutes;Firmicutes_Z;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus_porcinus; TCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTAGAACGCAGAGGACAGGTGC... Bacteria;Actinobacteria;Actinobacteria_Z;Actinomycetia;Pseudonocardiales;Pseudonocardiaceae;Saccharomonospora;Saccharomonospora_sp.; GCGTTGTTTCCATCGCTCTACCATGCAGTCGACGCTGAGCTCAGCTTGCTGGGTGGATGAGTGGCG... Eukaryota;Alveolata;Ciliophora;Spirotrichea;Hypotrichia;Oxytrichidae;Oxytricha;Oxytricha_granulifera; AGTCATATGCTTGTCTCAAAGACTAAGCCATGCATGTCTAAGTATAAATGTTATACAGTGAAACTGCGA... Eukaryota;Opisthokonta;Fungi;Ascomycota;Pezizomycotina;Sordariomycetes;Aschersonia;Aschersonia_placenta; GCTTGTCTCAAAGATTAAGCCATGCATGTCTGAGTATAAGCAATTATACAGCGAAACTGCGAATGGCT...

Looking for any suggestions as to why I cannot seem to get solid assignment, especially those eukaryotic sequences that are being assigned bacteria.

benjjneb commented 1 year ago

First thing I would check, maybe with a subset of sequences you know are being misassigned, is if it is a sequence orientation issue. assignTaxonomy(..., tryRC=TRUE) will also check the reverse-complement orientation of all query sequences. I know in other cases that reverse-complemented (relative to the reference orientation) bacterial sequences sometimes get assigned to Eukaryota.