benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
468 stars 142 forks source link

AssignTaxonomy with sequences less than 50 nt #601

Closed Biancabrown closed 5 years ago

Biancabrown commented 5 years ago

Hello,

I am doing a plant metabarcoding project, and I am interested in using the assignTaxonomy function to assign taxa from a personal plant reference database to sequences ranging from 10 to 150 nt. The 10 nt sequences are characteristic of plant datasets, so they are important to the final analysis.

The problem is that the assignTaxonomy function doesn't work on sequences shorter than 50 nt. Is there a way to relax this function to accommodate shorter sequences, or is there a way to work around this using DADA2? If there is no work around in DADA2 can you please explain the reason behind this strict cutoff? Also, suggest an alternative. I know I can use other programs such as BLAST, but is there one you would suggest in I cannot perform this function in DADA2?

I've been using DADA2 for my microbial sequences and was interested to see if it would work with these plant sequences. Because regardless of of microbial, fungal, or plant sequences, I'm assuming the principles of metabardocing should be universal?

benjjneb commented 5 years ago

The problem is that the assignTaxonomy function doesn't work on sequences shorter than 50 nt. Is there a way to relax this function to accommodate shorter sequences, or is there a way to work around this using DADA2? If there is no work around in DADA2 can you please explain the reason behind this strict cutoff? Also, suggest an alternative. I know I can use other programs such as BLAST, but is there one you would suggest in I cannot perform this function in DADA2?

The issue here isn't with the DADA2 method per se, it's with the naive Bayesian classifier method. In short, it relies on shredding reads into 8-mers, and assigning confidence to taxonomic assignments by random subsamplings of 1/8th of the pool of shredded 8-mers. That scheme isn't effective, i.e. the best hits are questionable and the confidence values misleading, for sequences that are too short, hence the minimum threshold enforced in our implementation (i.e. assignTaxonomy).

Reality is that method isn't the right one when such short sequences are involved. BLAST of some flavor probably would be an improvement with the E-value approach.