large percentage of merged reads flagged as chimeric, but not due to primers

ympiceno commented 5 years ago

Hello,

I'm working with NextSeq 16S data from stool samples. I would like to track sequences from one group of samples through other samples (donor to recipient), so it is important to recover as many true variants as possible - even if rare - while obtaining the 'cleanest' dataset possible. Our sequencing primers were constructed to not allow sequencing of the primer sequence itself, so my understanding is that we do not need to trim primers or other non-target 'bits' from the sequences. We have paired-end reads of roughly 150 bp each direction. I have sequences run on multiple NS runs, so I'm processing each set through the denoising and merging before concatenating the full set of samples. To test the process, though, I've run some smaller sets through the chimera checking step to see what parameters might be best to use for the rest of the sets. After merging f and r reads, I use the in-silico triming option to have all my merged reads be 252 - 254 bp. I am seeing nearly 90% of the merged sequence variants removed and 20-30% of the merged sequences being removed as chimeric. As the tutorial notes, "Here chimeras make up about 21% of the merged sequence variants, but when we account for the abundances of those variants we see they account for only about 4% of the merged sequence reads", it would appear there are parameters I should change to improve my result.

I have used a variety of filterAndTrim settings to test their effects on the number of merged sequences passing the chimera check step (e.g., truncLen = c(150, 140) or no truncation, maxEE = 2,2 or 2,3 ... 2,5, and set matchIDs = TRUE). These have not changed the number of sequences passing the chimera check much overall. I've also tried running the denoising step with no pooling or with pool = TRUE or pool = "pseudo"; again, not much difference. To see if the large number of sequences per sample (e.g., >200K) meant the error learning was being performed on too few samples - and so possibly not modeling error in the full run very well, I increased nbases to 1e+09 in the learnErrors part; no effect that I could discern. For the removeBimera step, I tried all three methods (method = "consensus" - or "pooled" or "per-sample"), again to no real avail.

Is it common for adult stool samples with potentially/presumably many sequences of highly related bacteria to yield a large percentage of merged reads as chimeric when processed through DADA2? The extraction blanks I've included have >95% merged reads pass the chimera check, which I presume is because there are few real sequences and the common contaminants observed in seq. data are not highly related (relatively speaking). Should I try altering the OMEGA_A parameter or is this also not likely to have much effect, given what I've tried so far? I'd like to move forward with the rest of the datasets, so if this has been observed previously for adult stool samples - or it seems there are no obvious other things that will likely improve the results post-chimera checking, I will proceed despite the tutorial's caution that "If most of your reads were removed as chimeric, upstream processing may need to be revisited."

Thank you for any thoughts/suggestions you may offer.

benjjneb commented 5 years ago

20-30% of reads as chimeric is high, but it isn't that uncommon either. Chimera read formation depends quite a bit on the details of the PCR protocol, and so can vary a lot from experiment to experiment. It also depends on the presence of varying but similar sequences (so that part of one can "prime" the elongation of the other variant during PCR and produce a chimera), hence the fact that you see much fewer in your mock which probably has very few strains that are well separated is again not that surprising.

You are asking the right questions though, and the key parameters to inspect are the removeBimeraDenovo parameters. Your observation that you are getting similar results with all three methods suggests this data just has a lot of chimeras, but the one other one I would try is to consider raising minFoldParentOverAbundance to larger values, e.g. 4 or 8. That can prevent FP chimera identifications that can arise between real sequences that differ by just 1 or 2 nucleotides. But if you still see a similar fraction of chimeric reads being flagged with those higher values, I would interpret it as meaning you have a high chimera PCR protocol and just move on.

You are asking the right questions though!

ympiceno commented 5 years ago

Great; thank you. I will work with the minFoldParentOverAbundance parameter to see what that yields. Thanks so much for the quick reply!

benjjneb / dada2

large percentage of merged reads flagged as chimeric, but not due to primers #602