benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 143 forks source link

DADA2 stand alone output and LotuS2 - DADA2 produces different outcomes #1818

Closed Balaveer closed 5 months ago

Balaveer commented 1 year ago

Hi @benjjneb,

For my amplicon sequencing analysis usually I use DADA2 stand alone or Qiime2 based DADA2. Recently, I came across LotuS2 (https://lotus2.earlham.ac.uk) bioinformatics pipeline which also uses DADA2 denoising algorithm. I read through their manuscript and came across a conceptual question. The authors mentioned that the 'not high quality reads' (midQual reads) after the quality filtering will be used for 'Backmapping onto ASVs'. My question is, doesn't it inflate the abundances of the ASVs which might be PCR or sequencing errors ? Two of our collaborators used DADA2 denoising on the same data but with different pipelines, one with DADA2 stand alone and other group with LotuS2. The results are different no. of sequences, ASVs and finally different diversity estimates. Some of us are confused, which pipeline to go for and actually which one is correct ? I would appreciate your reply and suggestions. Thank you!

benjjneb commented 1 year ago

I'm not familiar with LotuS2, but skimming the website it seems to implement a complex start-to-end pipeline with a lot of pieces outside of DADA2. That makes identifying what step might be causing differences between stand-alone DADA2 and the final outputs of that pipeline difficult.

The authors mentioned that the 'not high quality reads' (midQual reads) after the quality filtering will be used for 'Backmapping onto ASVs'. My question is, doesn't it inflate the abundances of the ASVs which might be PCR or sequencing errors ?

At first glance, I find this concept to be unappealing. My guess is the reason they extract out extra-high-quality reads to put through DADA2 (or the various other methods) is to speed the execution of those methods up by reducing the amount of data they are given. The cost will be a loss of sensitivity to rarer variants. And the back-mapping approach is unlikely to be helpful. It is often suggested that the number of reads pushed through a pipeline is a valuable metric. It is not. The parameters we are trying to measure are the relative abundances of the ASVs/OTUs/taxa. In any meaningful comparison, the number of reads is divided out at the end, either by making proportions or by some other perhaps compositional approach.

That said, I have not looked at any of the LotuS2 steps in detail, so take everything I am saying here with the appropriate grain of salt!

Balaveer commented 1 year ago

Thank you very much for your quick reply @benjjneb I very much appreciate it! Probably, I'll test the pipelines myself on a subset of the data and see where the differences are coming. I'll get back to you if I come up with any further questions.

Thanks, Balaveer