benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
460 stars 141 forks source link

DADA2 on non-HiFi PacBio data with ITS regions #897

Closed jdmagasin closed 4 years ago

jdmagasin commented 4 years ago

Dear Ben,

I wrote you few months ago for guidance on using PacBio for my research. I’m studying the intra-specific diversity of an uncultivated prymnesiophyte alga. I designed specific primers to amplify partial 18S+ ITS-1 + 5.8S + ITS-2+ partial 28S, which results in an amplicon size of about 1100 bps. We already got the sequencing data which looks very promising. I want to use DADA2 to process the sequencing data, so I and my colleague Jonathan are writing a dada2 script for our PacBio data. Our script is based on the R markdown used in your 2018 paper for the fecal data, and the big data example workflow. We have some questions in order to adapt our script to our sequencing data and would appreciate if you could give us your opinion:

  1. Can we use dada2 with regular PacBio CCS samples with minPredictedAccuracy = 0.7, in contrast to the HiFi samples in your 2018 paper with minPredictedAccuracy = 0.999? (In our case CCS were generated from raw reads by SMRT Link v 7.0.1.66975 using a minimum number of 2 passes.)

  2. Given the quality of our PacBio reads, do you have suggestions for special handling or parameters? Currently we are using the following parameters with filterAndTrim(): minQ=3, minLen=900, maxLen=1300, maxN=0, rm.phix=FALSE, maxEE=2.

  3. Our reads have more biological variability (due to the ITS) than the 16S reads in the 2018 paper. Do you have suggestions for handling reads like ours? In particular, we found that learnErrors() did not complete when given all our reads (with nbases = 1E8; we pulled the plug after 137 CPU hours). To speed up error model creation we tried:

    1. Using random subsets of the reads from each sample: N = {1K, 2K, 3K, 6K, 10K, 15K}. This solved the performance issue, and it looks like the models get only a little better (lower curves) after 6K reads. Our curves are higher than those from the 2018 fecal data, perhaps because our data is not HiFi?
    2. Then we realized that the ITS regions might be causing the performance problems, and that it makes more sense for learnErrors() not to see the ITS regions so that it cannot mistake biological variation for sequencing errors. So we tried feeding learnErrors() only the partial 18S (the first 292 nt of our reads, after primer removal and quality filtering), and in this case we halved BAND_SIZE to 16. Error model creation is now fast (78s). The models improved compared to the random subset approach, at least for bases with quality > ~85, though there is a curious hump around quality 70 (please see plot below). Do you think this approach is correct? We get roughly the same ASV’s across error models (Venn diagram below).

Thanks very much for your help!

Ana and Jonathan

Reads retained at each processing step

image

Error models based on sampling reads vs. partial 18S

image

ASV's found using error models in previous plot

image
benjjneb commented 4 years ago

Let's start here:

Can we use dada2 with regular PacBio CCS samples with minPredictedAccuracy = 0.7, in contrast to the HiFi samples in your 2018 paper with minPredictedAccuracy = 0.999? (In our case CCS were generated from raw reads by SMRT Link v 7.0.1.66975 using a minimum number of 2 passes.)

DADA2 makes a strong assumption that will determine whether your usage is valid, DADA2 assumes that a significant fraction of the reads have no errors (and thus the true seuquences will be seen multiple times). If that is the case in your data, you can use DADA2. If it isn't, you will need to use a lower resolution approach.

This is easily tested. What is the text output from a few typical samples of:

foo <- derepFastq(c(SAM1_FILE, SAM2_FILE, SAM3_FILE, ...), verbose=TRUE)

If the number of unique sequences found in these samples is less than the number of reads, you may be in business. If it isn’t, the DADA2 approach isn't going to work on this data.

jdmagasin commented 4 years ago

Thank you Ben. For all of our samples the number of unique sequences is 56-74% less than the number of quality-filtered reads. These are the dereplication stats from our log file:

Sample Total.filtered Unique % less
AMTN 116119 44232 62
AMTS 124023 43018 65
ARCTIC09 96974 26521 73
ARCTIC16 119601 39862 67
BAJA 66123 29046 56
CORAL 86229 22061 74
HOT311 135196 36000 73
JAPAN 107629 46078 57
SCW 56133 16642 70
SIO 118840 44210 63
benjjneb commented 4 years ago

Those stats look good for running DADA2.

As for special filtering parameters, the ones you have picked out seem reasonable. With filtering, the goal is typically not finding an ultimate set of parameters, but just avoiding wrong filtering parameters.

As for the error learning, I again think what you are doing all sounds reasonable to me, and the large amount of agreement between the different error models is promising as well. I also think you may get marginally better results with just the 18S region -- DADA2 does better in terms of fitting its error models and approaching self-consistency in the absence of large amounts of indel and length variation. It works with ITS, but not as neatly, so the 18S only approach for the error modeling seems like a decent idea to me.