Open hjarnek opened 1 month ago
What is the idea behind pooling samples for chimera removal? As chimeras are formed during PCR, when the samples are separated, shouldn't they be independent of the sequences in other samples? I understand the logic behind pooling samples for denoising, where the goal is to remove sequencing errors, as those are expected to occur independently, and thus pooling simply increases the statistical population. But for chimera removal I don't think it's quite as clear. I've read your https://github.com/benjjneb/dada2/issues/1042#issuecomment-648400339, where you state that chimeras often show up in multiple samples with very low abundances (~1), just like rare ASVs often do. However, as chimeras are formed independently within each sample tube during PCR, isn't that more of a coincidental observation? As compared to rare ASVs, which can be expected in multiple samples either through cross-contamination, or index hopping, or just naturally. In other words; is there a theoretical argument for pooling samples for chimera detection, or just empirical?
At its core, pooling is an idea that all samples come from a common community composition. If samples were completely distinct (no shared sequences) then pooling doesn't matter.
Let us assume an idealized starting point: All samples drawn from an identical community.
Let us consider a chimera X that is produced from parent A and parent B at a rate* of one-fifth.
Expected read numbers: (A, B, X) = (5, 5, 1) Realized read numbers: (2, 8, 0); (3, 1, 1); (10, 4, 0); (11, 2, 1); (3, 7, 2); (1, 9, 2); (5, 8, 0)
Per-sample, the realization of read counts for parent/chimera pairs include situations where the chimera has more reads than expected relative to the parents. Pooling clears out that per-subject read count variability.
But how appropriate is that initial idea that all samples come from a common distribution? It isn't completely true of course, but when does it work better?
Hi Ben, thanks for your response. Does this imply that one should only pool samples for chimera detection if they come from the same environment (e.g. sampling replicates)? And should you still always pool all samples from the same sequencing run for denoising, for maximum accuracy?
I'm still curious about the difference in default settings between isBimeraDenovo
and isBimeraDenovoTable
, and whether you agree that correcting PCR point errors constitutes a problem for chimera detection (I can't see how it would).
Does this imply that one should only pool samples for chimera detection if they come from the same environment (e.g. sampling replicates)? And should you still always pool all samples from the same sequencing run for denoising, for maximum accuracy?
Pooling samples only makes sense if there is commonality amongst the samples. The idea of pooling is that for any one sample there is helpful information from the other samples being pooled with it that could increase the accuracy of resolving true ASVs from errors/chimeras.
I'm still curious about the difference in default settings between isBimeraDenovo and isBimeraDenovoTable, and whether you agree that correcting PCR point errors constitutes a problem for chimera detection (I can't see how it would).
The isBimeraDenovoTable
settings are more "aggressive", because the voting across samples is an additional factor that prevents false positive chimera identification. Pooling and then isBimeraDenovo
does not consider per-sample information and thus is more prone to false-positive chimera ID, hence the more stringent default parameters for identifying chimeric ASVs.
Hi Ben,
I have a few questions regarding de-novo chimera removal.
1) What is the idea behind pooling samples for chimera removal? As chimeras are formed during PCR, when the samples are separated, shouldn't they be independent of the sequences in other samples? I understand the logic behind pooling samples for denoising, where the goal is to remove sequencing errors, as those are expected to occur independently, and thus pooling simply increases the statistical population. But for chimera removal I don't think it's quite as clear. I've read your post here, where you state that chimeras often show up in multiple samples with very low abundances (~1), just like rare ASVs often do. However, as chimeras are formed independently within each sample tube during PCR, isn't that more of a coincidental observation? As compared to rare ASVs, which can be expected in multiple samples either through cross-contamination, or index hopping, or just naturally. In other words; is there a theoretical argument for pooling samples for chimera detection, or just empirical?
2) Why are the default values
minFoldParentOverAbundance = 2
andminParentAbundance = 8
forisBimeraDenovo
, butminFoldParentOverAbundance = 1.5
andminParentAbundance = 2
forisBimeraDenovoTable
? The latter seems very stringent. As you mentioned in the linked thread, Robert Edgar (2016) recommended to increaseminFoldParentOverAbundance
(for pooled samples). You mentioned 8 as a possible target, but I can't find that recommendation in the paper, rather around 4. Is there any particular reason you recommendedminFoldParentOverAbundance = 8
for pooled samples, and do you still recommend that?3) In the same paper, under the heading "What is an amplicon?", Edgar argues that as denoising will remove PCR point errors just as sequencing point errors, it will degrade chimera filtering. He then hypothesizes that this is why DADA2 allows one point error/indel in chimera detection by default, to compensate for corrected PCR errors. But he considers this approach inferior, stating the expected higher FP rate, assuming PCR errors are rare and that most of them are created in later cycles. I would love to hear your thoughts on this. I can't quite make sense of why correcting PCR point errors would degrade the chimera detection. Wouldn't the problem rather be if they are not corrected? And why wouldn't they be corrected, by the way? If we assume PCR errors are indeed rare and created mainly in the later cycles, shouldn't PCR error chimeric sequences have so low abundances that they never can form their own centroids during denoising? And why do we actually expect PCR errors mainly in the later cycles?
By the way, I noticed a minor documentation error for
isBimera
. It says in the description thatBut the argument description for
allowOneOff
says thatI assume the latter is correct. Latest DADA2, 1.32.0.
Cheers