benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Questions about chimera filtering #2030

Open hjarnek opened 1 month ago

hjarnek commented 1 month ago

Hi Ben,

I have a few questions regarding de-novo chimera removal.

1) What is the idea behind pooling samples for chimera removal? As chimeras are formed during PCR, when the samples are separated, shouldn't they be independent of the sequences in other samples? I understand the logic behind pooling samples for denoising, where the goal is to remove sequencing errors, as those are expected to occur independently, and thus pooling simply increases the statistical population. But for chimera removal I don't think it's quite as clear. I've read your post here, where you state that chimeras often show up in multiple samples with very low abundances (~1), just like rare ASVs often do. However, as chimeras are formed independently within each sample tube during PCR, isn't that more of a coincidental observation? As compared to rare ASVs, which can be expected in multiple samples either through cross-contamination, or index hopping, or just naturally. In other words; is there a theoretical argument for pooling samples for chimera detection, or just empirical?

2) Why are the default values minFoldParentOverAbundance = 2 and minParentAbundance = 8 for isBimeraDenovo, but minFoldParentOverAbundance = 1.5 and minParentAbundance = 2 for isBimeraDenovoTable? The latter seems very stringent. As you mentioned in the linked thread, Robert Edgar (2016) recommended to increase minFoldParentOverAbundance (for pooled samples). You mentioned 8 as a possible target, but I can't find that recommendation in the paper, rather around 4. Is there any particular reason you recommended minFoldParentOverAbundance = 8 for pooled samples, and do you still recommend that?

3) In the same paper, under the heading "What is an amplicon?", Edgar argues that as denoising will remove PCR point errors just as sequencing point errors, it will degrade chimera filtering. He then hypothesizes that this is why DADA2 allows one point error/indel in chimera detection by default, to compensate for corrected PCR errors. But he considers this approach inferior, stating the expected higher FP rate, assuming PCR errors are rare and that most of them are created in later cycles. I would love to hear your thoughts on this. I can't quite make sense of why correcting PCR point errors would degrade the chimera detection. Wouldn't the problem rather be if they are not corrected? And why wouldn't they be corrected, by the way? If we assume PCR errors are indeed rare and created mainly in the later cycles, shouldn't PCR error chimeric sequences have so low abundances that they never can form their own centroids during denoising? And why do we actually expect PCR errors mainly in the later cycles?

By the way, I noticed a minor documentation error for isBimera. It says in the description that

Bimeras that are one-off from exact are also identified if the allowOneOff argument is TRUE.

But the argument description for allowOneOff says that

If FALSE, sequences that have one mismatch or indel to an exact bimera are also flagged as bimeric.

I assume the latter is correct. Latest DADA2, 1.32.0.

Cheers

benjjneb commented 1 month ago

What is the idea behind pooling samples for chimera removal? As chimeras are formed during PCR, when the samples are separated, shouldn't they be independent of the sequences in other samples? I understand the logic behind pooling samples for denoising, where the goal is to remove sequencing errors, as those are expected to occur independently, and thus pooling simply increases the statistical population. But for chimera removal I don't think it's quite as clear. I've read your https://github.com/benjjneb/dada2/issues/1042#issuecomment-648400339, where you state that chimeras often show up in multiple samples with very low abundances (~1), just like rare ASVs often do. However, as chimeras are formed independently within each sample tube during PCR, isn't that more of a coincidental observation? As compared to rare ASVs, which can be expected in multiple samples either through cross-contamination, or index hopping, or just naturally. In other words; is there a theoretical argument for pooling samples for chimera detection, or just empirical?

At its core, pooling is an idea that all samples come from a common community composition. If samples were completely distinct (no shared sequences) then pooling doesn't matter.

Let us assume an idealized starting point: All samples drawn from an identical community.

Let us consider a chimera X that is produced from parent A and parent B at a rate* of one-fifth.

Expected read numbers: (A, B, X) = (5, 5, 1) Realized read numbers: (2, 8, 0); (3, 1, 1); (10, 4, 0); (11, 2, 1); (3, 7, 2); (1, 9, 2); (5, 8, 0)

Per-sample, the realization of read counts for parent/chimera pairs include situations where the chimera has more reads than expected relative to the parents. Pooling clears out that per-subject read count variability.

But how appropriate is that initial idea that all samples come from a common distribution? It isn't completely true of course, but when does it work better?

hjarnek commented 1 week ago

Hi Ben, thanks for your response. Does this imply that one should only pool samples for chimera detection if they come from the same environment (e.g. sampling replicates)? And should you still always pool all samples from the same sequencing run for denoising, for maximum accuracy?

I'm still curious about the difference in default settings between isBimeraDenovo and isBimeraDenovoTable, and whether you agree that correcting PCR point errors constitutes a problem for chimera detection (I can't see how it would).

benjjneb commented 2 days ago

Does this imply that one should only pool samples for chimera detection if they come from the same environment (e.g. sampling replicates)? And should you still always pool all samples from the same sequencing run for denoising, for maximum accuracy?

Pooling samples only makes sense if there is commonality amongst the samples. The idea of pooling is that for any one sample there is helpful information from the other samples being pooled with it that could increase the accuracy of resolving true ASVs from errors/chimeras.

I'm still curious about the difference in default settings between isBimeraDenovo and isBimeraDenovoTable, and whether you agree that correcting PCR point errors constitutes a problem for chimera detection (I can't see how it would).

The isBimeraDenovoTable settings are more "aggressive", because the voting across samples is an additional factor that prevents false positive chimera identification. Pooling and then isBimeraDenovo does not consider per-sample information and thus is more prone to false-positive chimera ID, hence the more stringent default parameters for identifying chimeric ASVs.