benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 141 forks source link

Merging two different runs may reflect in a samples separation in a downstream ordination? #1830

Closed MatS792 closed 1 month ago

MatS792 commented 9 months ago

Hi all, I am using DADA2 to analyze 13 biofilms samples sequenced for the 16S rRNA V4 region.
The samples were sequenced in two different periods. In detail 12 samples are sequenced together and 1 sample was sequenced one month later. Primers, sequencing company, Illumina seq., etc are the same.

I used two different approaches to analyze them: 1° approach : Running dada2 on the two groups separately (12 together in a dada2 run and 1 alone in another dada2 run) and merging the two seqtabs before removing the chimeras 2° approach : Running dada2 on all the samples together

In both the two approaches, the settings for filter and trimming, error learning, dada2, and taxa assignation were the same. In terms of ASVs and reads, I got: 1° approach 2° approach
ASVs 6853 7115
Reads 2791892 2766477

Once I move in @phyloseq, after the normalization of the reads, I run an nMDS and what I find is this:

The sample that was sequenced separately is within the red square

From the 1° approach

first_apporach_github

From the 2° approach

second_apporach_github

In synthesis: When I follow the 1° approach, the sample that was sequenced later, separate completely from the others. When, I follow the 2° approach this doesn't happen and the sample mix well with other similar samples. The general idea is that when you get samples from different sequencing run, you should follow separated dada2 run and merge everything before the chimera removing.

How much this thing is written in the stone? Have you ever seen something like that? Is there any reason behind this?

Thank you for your help.

benjjneb commented 9 months ago

Our general recommendation is to do as y ou did in the 2nd approach, run different amplicon sequencing batches through dada2 separately, and then merge them before chimera removal. This allows the appropriate error model for each batch to be learned and applied. If two batches have different error models for some reason, processing them all together leads to errors in inference.

It's a bit surprising to see such a large effect given that \

Primers, sequencing company, Illumina seq., etc are the same.

but its often hard to know exactly how the same some steps are when they are performed months apart.

MatS792 commented 8 months ago

Do you suggest to go ahead with the first approach even if it is less reccomended ?

benjjneb commented 8 months ago

Given your NMDS results, and the recommended way to treat data with DADA2, I would suggest the 2nd approach.

MatS792 commented 8 months ago

I am sorry but I am getting confused. In the first answer are you referring to the 2nd approach as it is the 1st approach I described (running 2 different batches and then merging them before chimera removal) ?

benjjneb commented 8 months ago

Yes I was confused.

Given your NMDS results, I would suggest the 2nd approach even though it doesn't follow the standard recommendation. Clearly the first approach is leading to a large post-processing batch effect, and given that it is just one sample, that may be due to a poorly fit error model from that single sample. So here, given the NMDS and the same methods used for each batch, putting everything together for processing is appropriate.