Differences in chimera detection based on dataset structure

seoldh commented 5 months ago

I have samples sequenced targeting the same 16S partial region from two different institutions, with about 120 and 300 samples respectively. I'm unsure how to correct for batch effects, so the first thing I did was try the following three commands to see what difference pooling makes:

qiime dada2 denoise-paired --i-demultiplexed-seqsA_institution.qza (120 samples) --p-trunc-len-f N --p-trunc-len-r M --p-trim-left-f L --p-trim-left-r O --p-pooling-method pseudo --p-chimera-method pooled
qiime dada2 denoise-paired --i-demultiplexed-seqsB_institution.qza (300 samples) --p-trunc-len-f N --p-trunc-len-r M --p-trim-left-f L --p-trim-left-r O --p-pooling-method pseudo --p-chimera-method pooled
qiime dada2 denoise-paired --i-demultiplexed-seqsA+B_institution.qza (420 samples) --p-trunc-len-f N --p-trunc-len-r M --p-trim-left-f L --p-trim-left-r O --p-pooling-method pseudo --p-chimera-method pooled

In (1), chimeras were detected and filtered out, but in (2) and (3) cases, chimeras were not detected at all in any of the samples, i.e., all samples in A_institution (120samples) > the count after merging in (1) ≈ the count after merging in (3) = the count after chimera removal (3) > the count after chimera removal in (1).

Are there any parameters I need to adjust for chimera detection when using large dataset? Or could there be other causes? The sequencing quality plots for raw data from two institutions are similar, but institution A has an average read count of 40,000 while institution B has an average read count of 170,000, a difference of about 4x.

benjjneb commented 5 months ago

The pooled chimera detection method should only be used if using pooled denoising. It should not be used with the default denoising (independent) or with pseudo-pooling. So I would recommend as a first step adjusting the pooling modality.

(it would be good to have clearer documentation on that, or maybe a warning message when pooling method and denoising method are misaligned)

seoldh commented 5 months ago

In all three cases, I used the parameter --p-pooling-method pseudo --p-chimera-method pooled Since qiime only provides two pooling methods, independent and pseudo, I'm going to try to use R to apply pool=TRUE rather than pseudo.

benjjneb / dada2

Differences in chimera detection based on dataset structure #1942