Closed jdmagasin closed 4 years ago
Let's start here:
Can we use dada2 with regular PacBio CCS samples with minPredictedAccuracy = 0.7, in contrast to the HiFi samples in your 2018 paper with minPredictedAccuracy = 0.999? (In our case CCS were generated from raw reads by SMRT Link v 7.0.1.66975 using a minimum number of 2 passes.)
DADA2 makes a strong assumption that will determine whether your usage is valid, DADA2 assumes that a significant fraction of the reads have no errors (and thus the true seuquences will be seen multiple times). If that is the case in your data, you can use DADA2. If it isn't, you will need to use a lower resolution approach.
This is easily tested. What is the text output from a few typical samples of:
foo <- derepFastq(c(SAM1_FILE, SAM2_FILE, SAM3_FILE, ...), verbose=TRUE)
If the number of unique sequences found in these samples is less than the number of reads, you may be in business. If it isn’t, the DADA2 approach isn't going to work on this data.
Thank you Ben. For all of our samples the number of unique sequences is 56-74% less than the number of quality-filtered reads. These are the dereplication stats from our log file:
Sample | Total.filtered | Unique | % less |
---|---|---|---|
AMTN | 116119 | 44232 | 62 |
AMTS | 124023 | 43018 | 65 |
ARCTIC09 | 96974 | 26521 | 73 |
ARCTIC16 | 119601 | 39862 | 67 |
BAJA | 66123 | 29046 | 56 |
CORAL | 86229 | 22061 | 74 |
HOT311 | 135196 | 36000 | 73 |
JAPAN | 107629 | 46078 | 57 |
SCW | 56133 | 16642 | 70 |
SIO | 118840 | 44210 | 63 |
Those stats look good for running DADA2.
As for special filtering parameters, the ones you have picked out seem reasonable. With filtering, the goal is typically not finding an ultimate set of parameters, but just avoiding wrong filtering parameters.
As for the error learning, I again think what you are doing all sounds reasonable to me, and the large amount of agreement between the different error models is promising as well. I also think you may get marginally better results with just the 18S region -- DADA2 does better in terms of fitting its error models and approaching self-consistency in the absence of large amounts of indel and length variation. It works with ITS, but not as neatly, so the 18S only approach for the error modeling seems like a decent idea to me.
Dear Ben,
I wrote you few months ago for guidance on using PacBio for my research. I’m studying the intra-specific diversity of an uncultivated prymnesiophyte alga. I designed specific primers to amplify partial 18S+ ITS-1 + 5.8S + ITS-2+ partial 28S, which results in an amplicon size of about 1100 bps. We already got the sequencing data which looks very promising. I want to use DADA2 to process the sequencing data, so I and my colleague Jonathan are writing a dada2 script for our PacBio data. Our script is based on the R markdown used in your 2018 paper for the fecal data, and the big data example workflow. We have some questions in order to adapt our script to our sequencing data and would appreciate if you could give us your opinion:
Can we use dada2 with regular PacBio CCS samples with minPredictedAccuracy = 0.7, in contrast to the HiFi samples in your 2018 paper with minPredictedAccuracy = 0.999? (In our case CCS were generated from raw reads by SMRT Link v 7.0.1.66975 using a minimum number of 2 passes.)
Given the quality of our PacBio reads, do you have suggestions for special handling or parameters? Currently we are using the following parameters with filterAndTrim(): minQ=3, minLen=900, maxLen=1300, maxN=0, rm.phix=FALSE, maxEE=2.
Our reads have more biological variability (due to the ITS) than the 16S reads in the 2018 paper. Do you have suggestions for handling reads like ours? In particular, we found that learnErrors() did not complete when given all our reads (with nbases = 1E8; we pulled the plug after 137 CPU hours). To speed up error model creation we tried:
Thanks very much for your help!
Ana and Jonathan
Reads retained at each processing step
Error models based on sampling reads vs. partial 18S
ASV's found using error models in previous plot