DADA2 for PacBio, and MiSeq resulted in high counts of chimeric sequences

emankhalaf commented 2 years ago

Hi,

I am new to PacBio technology and I am currently working on 16S sequences generated from both Illumina MiSeq and PacBio Sequel II tech. I have 3 different projects including different plant tissues. I have a few questions regarding the initial analyses of 16S reads.

First, regarding PacBio sequences, I am using DADA2 tutorials provided in (Callahan, et al., 2019: High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution). I noticed that in the script of DADA2 + PacBio: Fecal Samples, there were chimeras (39 chimeric sequences), however, the downstream analyses were performed on st.rds, not st.nochim. is that okay?

From tutorials, the taxonomy was assigned based on st, not st.nochim, is that true? should I proceed similarly in my analysis protocol? I think I should use st.nochim to assign taxonomy and proceed in analyses, correct?
When I checked chimeras from the first project: 535 sequences identified where 182 were bimera (bim true (351), false (184)), and [sum(st[,bim])/sum(st)], I got this (0.06553758). Is it a good value for my sequences?
In DADA2 documentation, there is a repeated paragraph in different functions mentioning that (QualityType: (Optional). character(1). The quality encoding of the fastq file(s). "Auto" (the default) means to attempt to auto-detect the encoding. This may fail for PacBio files with uniformly high-quality scores, in which case use "FastqQuality". This parameter is passed on to readFastq; see information there for details. I do not know how to use this parameter and how could I know if I need to use this optional step for my data or not?

Second, regarding Illumina MiSeq sequences (I have previous experiences in other projects), I use Qiime2 for analyses, and I noticed that the percentage input non-chimeric (dada2-stat.qzv) from sequenced samples from the 3 different projects ranged in total between (15-25%) which is too low. However, from a previous project, the majority were above 90%. I am wondering if there is something to do to check the reason behind this huge variability. I checked with the sequencing facility and they responded that everything is okay from their side regarding the sequencing protocol and quality checks. In reality, the quality of the sequences is good when I checked demux_paired_end.qzv. Any explanation, please?

Thanks for any help you can provide! Eman

benjjneb commented 2 years ago

From tutorials, the taxonomy was assigned based on st, not st.nochim, is that true? should I proceed similarly in my analysis protocol? I think I should use st.nochim to assign taxonomy and proceed in analyses, correct?

You should probably use st.nochim.

When I checked chimeras from the first project: 535 sequences identified where 182 were bimera (bim true (351), false (184)), and [sum(st[,bim])/sum(st)], I got this (0.06553758). Is it a good value for my sequences?

This is well within expectations. The PCR protocol used in the DADA2+PacBio paper was low cycle number (12 if I remember) and can be expected to have fewer chimeras than higher cycle number protocols. As long as you are using the minFoldParentOverAbundance=3.5 (or higher) than I think you are fine to proceed. PCR produces many chimeras, but mostly at very low frequencies, which is consistent with what you are seeing.

In DADA2 documentation, there is a repeated paragraph in different functions mentioning that (QualityType: (Optional). character(1). The quality encoding of the fastq file(s). "Auto" (the default) means to attempt to auto-detect the encoding. This may fail for PacBio files with uniformly high-quality scores, in which case use "FastqQuality". This parameter is passed on to readFastq; see information there for details. I do not know how to use this parameter and how could I know if I need to use this optional step for my data or not?

If things are running afterwards, you are OK.

Second, regarding Illumina MiSeq sequences (I have previous experiences in other projects), I use Qiime2 for analyses, and I noticed that the percentage input non-chimeric (dada2-stat.qzv) from sequenced samples from the 3 different projects ranged in total between (15-25%) which is too low. However, from a previous project, the majority were above 90%. I am wondering if there is something to do to check the reason behind this huge variability. I checked with the sequencing facility and they responded that everything is okay from their side regarding the sequencing protocol and quality checks. In reality, the quality of the sequences is good when I checked demux_paired_end.qzv. Any explanation, please?

I would need a more complete description of what you are seeing here to try to diagnose, but the cause of the vast majority of issues with too many chimeras being detected is that primers were not removed. Is there a difference in how you are (or aren't) removing primers in these two workflows?

emankhalaf commented 2 years ago

@benjjneb

Thanks so much for your helpful answers. Yes, primers were not removed, so I adjusted the denoise code and the non-chimeric sequences now ranged between 73-84% with the majority above 80%.

Thanks again!

emankhalaf commented 2 years ago

@benjjneb One more question, my current projects aimed to explore the microbiome of overlooked plant tissues using 2 different sequencing technologies (MiSeq vs PacBio). Is there any problem to use different minFoldParentOverAbundance values in each workflow? For example, in the case of Illumina sequences I tried to use minFoldParentOverAbundance 1 (default) and 3.5 (similar to Pacbio sequences), and I got the same results. Whereas, for PacBio sequences, I used 3.5, 8 and I got the same results. So, is it okay to use minFoldParentOverAbundance 1 for Illumina, and 3.5 for Pacbio for the same sequenced samples?

benjjneb commented 2 years ago

Is there any problem to use different minFoldParentOverAbundance values in each workflow? For example, in the case of Illumina sequences I tried to use minFoldParentOverAbundance 1 (default) and 3.5 (similar to Pacbio sequences), and I got the same results. Whereas, for PacBio sequences, I used 3.5, 8 and I got the same results. So, is it okay to use minFoldParentOverAbundance 1 for Illumina, and 3.5 for Pacbio for the same sequenced samples?

Yes, it is OK. You used different sequencing technologies already. It is OK to use different processing parameters that are most appropriate to each technology.

Parameters shouldn't be arbitrarily different. Here, there is a reason for the difference. PacBio full-length 16S sequencing will expose the full allelic variation of the 16S rRNA gene in each bacteria, and too low minFoldOverAbundance will lead sometimes to the minor alleles being classified as chimeras. This isn't an issue with short-read data.

emankhalaf commented 2 years ago

@benjjneb Thanks so much for your clarification. Best, Eman

benjjneb / dada2

DADA2 for PacBio, and MiSeq resulted in high counts of chimeric sequences #1445