benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 142 forks source link

Overwriting the default error rates estimation for Dada2 + Pacbio pipeline #578

Closed pneumowidow closed 5 years ago

pneumowidow commented 5 years ago

Hello, I've been trying to use the DADA2 pipeline to resolve ASVs in my PACBIO reads. In brief, my region of interest is ~1.3kb and highly polymorphic among my strains ranging from 3 to 17 SNPs. Within the different mock communities of strain mixtures I prepared to assess the sensitivity of DADA2 in calling these strains, I realized that some of the ASVs were not always 100% accurate. For example, I would get the exact number of strains from the clustering output, however on inspecting the sequences, some loci would be incorrectly called (mostly within the least representative sequence variant). For example, in one strain mixture of 1:8, I get an output of 2 ASVs. The first cluster is 100% identical to the reference sequence, but the second is only 14% identical to it's reference. As you can see in the snapshot below the second ASV cluster only differs from the 1st ASV cluster by 2 (under birth_ham), which is absolutely wrong. The hamming distance should be 14.

dada2_cluster_result

So, I was wondering maybe it has to do with the error estimation. I looked at my error estimation plot and the one on the tutorial page and realized that not all my observed error rates (black points) fit well with the fitted error rates (black lines). You can see that for 3 out of 4 of the plots on each row, some or all of my observed error rates are way below the estimated line. So, I was wondering if this wrong estimation could be the reason I'm getting less accurate ASV results and if so, how could I change it?

pacbio_errors

Thank you!!!

benjjneb commented 5 years ago

Thanks for the report, it is very useful to hear this kind of feedback when we add support for a new technology.

Can you clarify a few details? What sequencing instrument (e.g. Sequel) and chemistry version are you using? What thresholds did you enforce at the ccs step (e.g. minPredictedAccuracy)? What exactly are these mixtures you are considering, i.e. it's an artificial mixture of two strains at a range of relative ratios?

My first reaction is that I would be a little concerned here about the error rate esimates because you are doing it on relatively little data. If you have multiple samples from the same run/cell, it's probably better to learn the error rates over them all together to up the statistical support for the learned error rates. You can do with with the learnErrors command.

I'd also be a little concerned that the high concentration of Q=93 values in CCS data is distorting the learned error rates at Q~90 to be too low. This is why we added a dedicates function for pacbio data, PacBioErrfun and recommend using that instead of the default for learning the error rates. Have you tried that? The PacBio specific workflow can be seen at the reproducible workflows for the preprint. See the "Analysis of..." links here: https://github.com/benjjneb/LRASManuscript

pneumowidow commented 5 years ago

Dear Dr, Callahan, Thank you so much for the quick and detailed response to my questions.

We used the sequel instrument for sequencing together with the P2-C2/5.0 chemistry. I literally followed the exact steps you used for your DADA2 + PacBio: ZymoBIOMICS tutorial from CCS of 99.9% accuracy generation up to the clustering steps. The only difference was that in my CCS generation step, I didn't include a minimum/maximum length threshold; this threshold was later set in the filtering steps where I set minimum length to 1200, the maximum length to 1300 and the maximum errors to 1 (as seen below). track <- fastqFilter(nop, filt, minQ=3, minLen=1200, maxLen=1300, maxN=0, rm.phix=FALSE, maxEE=1, verbose=TRUE) I set the length filter as such because my region of interest (without primers) is only 1268 bp long. I reduced the maxEE to 1 (basically making it more conservative than your default settings of 2) because the SNPs between strains in my genomic region of interest could also be as low as 2. Moreover, when I first tried your default settings with maxEE=2, I got double the number of ASVs I expected in certain communities, but once I changed it to maxEE=1, the ASV output numbers were more accurate.

The mixtures I'm considering are artificial mixtures of (yes) relative ratios. I basically took in-house strains of DNA concentration 0.5 ng/ul to 0.7 ng/ul and mixed them in different ratios (1:8, 1:9, 1:49, 1:3:5, 1:99) to determine the sensitivity of dada2 in calling each variant within a mixed community. Some results have been clear, accurate and precise. Others, like the sample with the 1:8 mixture (detailed the last post) haven't been so easy to interpret.

You're right about my data being relatively small in each sample and your suggestion about using multiple samples to learn the errors is definitely sound. I'll try that now. Also, your comment about the distortion on learned error rates based on higher concentration of CCS values (in my case with Q=99.9) also makes sense. Do you think I should also generate another set of CCS reads with predicted accuracy threshold of 90% and compare the error estimation result of all my samples in this dataset to that of the CCS reads with >99.9% accuracy? I'm going to try both and let you know my results.

Many thanks again!!!

pneumowidow commented 5 years ago

Hello again,

So I followed your suggestions above of including multiple samples (basically all the mixed samples in my run, which was 11) to estimate the errors. I used two different datasets: (i) one dataset generated with CCS accuracy threshold of 90% and; (ii) generated with CCS accuracy threshold of 99%. I followed again the same steps you did in the Dada2 + Pacbio Zymobiomics tutorial except in the code below:

In removing primers, I wanted an exact match, so I changed the max.mismatch argument to 0

for(i in seq_along(fn)) {
  dada2:::removePrimers(fn[[i]], nop[[i]], primer.fwd=F_ply, primer.rev=dada2:::rc(R_ply), max.mismatch=0, orient=TRUE, verbose=TRUE)
}

Results With CCS dataset of 99.9% accuracy for 11 samples, I got the plot below, which is a huge improvement from the plot I generated for a single sample above in my previous post last week. However, there still doesn't seem to be a good fit between my observed and model error rates for the transitions, G2C, C2G and A2T. The fit of error rates for T2A is not great but acceptable.

image

With CCS dataset of 90% accuracy for 11 samples, I got the plot below, which is only slightly better than the plot above for CCS dataset with 99.9% accuracy. I get maybe 3 or 4 more black points to fit the model error rates for transitions, G2C, C2G and A2T, but it may as well be negligible since it doesn't significantly improve my fit. The fit of error rates for T2A is however, much better here.

image

Any other ideas for what I can do to improve the error rates for these specific transitions? Or do you think it has to do with my samples in general?

Thank you.

benjjneb commented 5 years ago

Those look good to me. What's the concern?

The points at the bottom of G2C, C2G and A2T are Q scores for which that transition was never observed, and should just be ignored when considering the fit.

pneumowidow commented 5 years ago

Dear Ben,

The points at the bottom of G2C, C2G and A2T are Q scores for which that transition was never observed, and should just be ignored when considering the fit.

Understood! I assumed that the few number of observed error rates for these transitions (fitting the model) compared to other transitions meant that maybe the model was overestimating the error frequency for these transitions (mainly G2C & C2G) in my sample and probably distorting my ASV output later downstream. I guess I may not fully understand the error plot output. Can you just explain why even with less black dots (observed error rates) for G2C/C2G intersecting with the model compared to others transitions (say, G2A), you still consider the model for G2C & C2G as good? I mean, it just seems weird to me that all other transitions have so many points fitting with the model and then G2C/C2G transition doesn't. And if you compare my plots above to yours (from the Dada2+Pacbio Zymobiomics tutorial) below, you'll see that your subplots for the G2C &C2G are more acceptable. I mean the fit is generally better.

image

So, I'm sorry if it seems like a dumb question, but I may probably have to explain this pipeline to medical and less scientific people and the subplots for the aforementioned transitions just really stick out from the rest and that just doesn't seem normal to me for a good model fit.

Thank you again for your time. I really appreciate it.

benjjneb commented 5 years ago

The error-estimation procedure uses pseudo-counts to avoid under-estimating the error-rates when there were no errors of a particular type and quality score observed. It also is sharing observations across neighboring quality scores. So what you are seeing in the e.g. first G2C plot is the effect of those pseudo-counts. But that is intentional and good, and the very slight overestimate is much prefereable to the underestimate from taking zeros as is -- there will always be an error eventually in each category but sometimes there just isn't enough sequencing depth to find it in a particular dataset.

The Zymo dataset you are looking at has less of those zeros, perhaps because it is a bit deeper, but the fitted error models still look very similar to the ones you are getting.

pneumowidow commented 5 years ago

Dear Ben,

Many thanks for explaining that! I was beginning to think that my sequencing depth may have had an influence and I guess you may be right. I think I will repeat the sequencing with higher depth and see from there.

Thanks again