benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
464 stars 142 forks source link

Questions about HiFi filtering process #2013

Open mentorwan opened 2 weeks ago

mentorwan commented 2 weeks ago

Hi DADA2 help,

We used DADA2 to process our PacBio HiFi reads from two machines: SeqII and Revio. We completed the DADA2 pipeline for both, but encountered an issue with the Revio samples—only about 60% passed the filtering step. Here is an example:

Sample ID Input Filtered Percentage Passed Filter Denoised Non-chimeric Percentage Non-chimeric SC718268 324183 204214 62.99% 202203 192609 59.41%

For comparison, here are the results for the same sample from the SeqII machine:

Sample ID Input Filtered Percentage Passed Filter Denoised Non-chimeric Percentage Non-chimeric SC718268 45832 37415 81.64% 36725 36215 79.02%

We confirmed that adapter sequences were removed before running DADA2, using maxEE=2 as a default parameter. Can you explain why there is a discrepancy between the SeqII (80%) and Revio (60%) read percentages? Are there any parameters we should check or troubleshoot? Thanks a lot.

benjjneb commented 2 weeks ago

Could you clean up your tables to display better on Github? Right now I can't tell what numbers correspond to what column.

mentorwan commented 2 weeks ago

Here it is. First line is from Revio machine, Second line is from SeqII machine

image
benjjneb commented 2 weeks ago

The percentage of reads making it through the steps after filtering is very high in both cases, so I don't see any fundamental concerns with the data here. In general read loss at the filtering step is the least problematic kind of read loss.

My first guess as to why there are more reads being lost from the Revio data than the SeqII data is that the data is lower average quality, so more are being lost to the maxEE filter. You can inspect the plotQualityProfile output as a basic check on that. There may also be differences in the bioinformatics choices being made upstream. In particular, there are two important flags in the pacbio software that outputs the CCS (HiFi) reads from the raw Pacbio ddata, minPredictedAccuracy and minPasses (I think that's what it was called). If those parameters were different between the two runs, than that could explain the change in average read quality between the two.

mentorwan commented 2 weeks ago

Thanks. Will check those parameters.

proteinosome commented 4 days ago

@mentorwan I just saw this issue and wanted to mention that it's potentially because you're running the Kinnex 16S kit on the Revio vs the standard full-length protocol on Sequel II. With the Kinnex kit because the 16S sequences are concatenated into a long molecule, they go through lower number of passes and could have lower average accuracy (Usually >Q30 average) than a monomer 16S run (Can be > Q40 average).

However, I should note that the increase in total number of reads with the Kinnex even with very strict Q score filtering (E.g. you can do a hard Q30 read-quality cut-off for Kinnex) is still far more than what DADA2 throws out due to lower average quality.

Hope this helps.

mentorwan commented 4 days ago

@proteinosome Thanks for the insights. We found that Sequel II has a high QC range (>40) while Revio has a low QC range (>30). We use a QC value of 20 for filtering, but DADA2 seems to discard more reads. Changing MaxEE from 2 to 12 helped significantly (over 95% of reads were retained). We haven't tried using a QC value of 30 for filtering yet. You might be right about its effectiveness. Thanks again!

proteinosome commented 4 days ago

@mentorwan Yeah, that sounds to me like what you would expect from two different protocols. You can get Q40 reads on the Revio too if you go with the standard monomer full-length 16S protocol instead of using Kinnex, if you're willing to accept the hit in throughput, but that is typically not what I would recommend since even with the hard filtering you get a lot more reads from Kinnex.

Re: using maxEE of 12, I don't know if that's a good idea. The advantage of going with the ASV approach is that you get a set of ultra-accurate sequences representative of the species in your data. If you're so inclined to make use of all the reads you could just re-map the reads back onto the set of ASV sequences after the DADA2 pipeline, and count the abundance from the alignment. That way at least you don't introduce errors in the ASVs. I'll defer to @benjjneb to comment on whether that's a reasonable approach or not.

FWIW I develop PacBio's HiFi-16S-workflow and are very much interested in understanding problems like this from users.

benjjneb commented 3 days ago

Re: using maxEE of 12, I don't know if that's a good idea. The advantage of going with the ASV approach is that you get a set of ultra-accurate sequences representative of the species in your data. If you're so inclined to make use of all the reads you could just re-map the reads back onto the set of ASV sequences after the DADA2 pipeline, and count the abundance from the alignment. That way at least you don't introduce errors in the ASVs. I'll defer to @benjjneb to comment on whether that's a reasonable approach or not.

Completely agree with this. Another perspective...

What you are trying to get out of 16S sequencing is a measurement of the membership in the microbial community and their relative abundances. The total number of reads that make it through the pipeline is only relevant in the sense that it can affect the accuracy of the measured relative abundances. All else equal, more reads is better because you reduce count noise in low abundance taxa and increase sensitivity to rare (~1 read) taxa. But here all else is not equal. By increasing maxEE dramatically you are keeping less accurate data. In practice, we usually see this leading to less accurate measurements of relative abundances at the end, because the low quality data that is being pushed through the pipeline just adds noise.

Shorter: The total read counts at the end don't matter, because everything is a relative abundance anyway. So your decisions on questions like this should be based on improving the accuracy of the relative abundance measurements, not on the total number of reads in the final table.

mentorwan commented 3 days ago

Understand the logic for both authors. Thanks so much for insightful comments to our question. We do get additional unique species from less accurate results. We are trying to see if they are useful or not. Maybe it is only false positive hits, but we will check.