benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Loosing >30% of Reads from Filtering to Dada step, and Dada to merging #837

Closed Jatbee32 closed 5 years ago

Jatbee32 commented 5 years ago

Hi Ben,

I have a question regarding why I might be loosing so many reads from the Filtering to dada_f and dada_r step. The samples are 2x250bp Illumina sequencing V4. In my pipeline, the parameters are set for a truncLen of (200,200), the maxEE was set to (5,5) because in previous pipelines there were many reads being lost in the filtering step, and the merge length is set to 135. I also perform a derepFastq() function prior to dada() but I don't think that should make too much of a difference in the overall reads.

When looking at the summary file of the reads (which I am attaching here) I loose in a couple of samples over 60% of the reads from the filtering step to the dada() step. The rest of the samples seem pretty alright with this step. However, I am also loosing anywhere from 32-84% of my reads during the merging section. I have triple checked to make sure the primers were not in the fastq files, so I'm not sure what is occurring to make such a drop.

Any ideas as to why there are such drastic drops in reads? Summary_tab_filtration_scores Summary_tab_filtration_scores.c.pdf

benjjneb commented 5 years ago

Could you post example denoised forward and reverse sequences? E.g. the output of:

dada2:::pfasta(getSequences(dadaFs[[1]])[1:5])
dada2:::pfasta(getSequences(dadaRs[[1]])[1:5])
Jatbee32 commented 5 years ago

Hi Ben,

Sorry for the delay. Here is the data you have requested. The dataset is a coral microbiome dataset.

dada2::getSequences(dada_forward[[1]])[1:5] [1] "TACGTAGGTGGCGAGCGTTGTCCGGAATTACTGGGTGTAAAGGGTGCGTAGGCGGGGATGCAAGTCAGATGTGAAAGACCGGGGCTCAACTCCGGGGCTGCATTTGAAACTGCAACTCTTGAGTGCAGGAGAGGAAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGAGATTAGGAGGAACACCAGTGGCGAAGGCG" [2] "TACGGAGGATGCAAGCGTTATTCGGAATTATTGGGCGTAAAGGGTCTGTAGGTGGTTTTTTAAGTCTACTGTTAAATCTTAAGGCTTAACCTTAAAAAAGCGGTATGAAACTAAAAAGCTTGAGTTTAGTAGGGGTAGAGGGAATTCTCGGTGTAGTGGTGAAATGCGTAGAGATCGAGAAGAACACCGGTAGCGAAGGC" [3] "CACGTAGAGGGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGGGTACGCAGGTGGGAGGAAAAGTCAAGTGTGAAAGGTATCGGCTTAACCGATACAGAGCAATTGAAACTATCCTTCTTGAGGGCAGGAGAGGAGAGCGGAATTCCTGGTGTAGCGGTGGAATGCGTAGATATCAGGAAGAACACCGGTGGCGAAGGCG" [4] "ACTTAAGGGGTTGTTTTTTAGAAAAAGCAGAAAGCGTGTAAAGGTAAAACATTTAAAATAAATAGAATTTTTTTAGTAATGGTGCAATATTAAAAGAAAAAAAAGAATTTTTTTATGTGAAGACAATTTATTTTTTTTCTTTAACACGAAGGTTCTGGGAGCGAACAGGATTAGAAACCCTTGTAGTCCGGCTGACTGAC" [5] "ACTTAAGGGTTTGTTTTTTAGAAAAAGCAGAAAGCGTGTAAAGGTAAAACATTTAAAATAAATAGAATTTTTTTAGTAATGGTGCAATATTAAAAGAAAAAAAAGAATTTTTTTATGTGAAGACAATTTATTTTTTTTCTTTAACACGAAGGTTCTGGGAGCGAACAGGATTAGAAACCCCAGTAGTCCGGCTGACTGAC"

dada2::getSequences(dada_reverse[[1]])[1:5] [1] "CCTGTTCGCTCCCAGAACCTTCGTGTTAAAGAAAAAAAATAAATTGTCTTCACATAAAAAAATTCTTTTTTTTCTTTTAATATTGCACCATTACTAAAAAAATTCTATTTATTTTAAATGTTTTACCTTTACACGCTTTCTGCTTTTTCTAAAAAACAAACCCTTAAGTTTCACCGCGGCTGCTGGCACA" [2] "CCTGTTCGCTCCCAGAACCTTCGTGTTAAAGAAAAAAAATAAATTGTCTTCACATAAAAAAATTCTTTTTTTTCTTTTAATATTGCACCATTACTAAAAAAATTCTATTTATTTTAAATGTTTTACCTTTACACGCTTTCTGCTTTTTCTAAAAAACAAACCCTTAAGTTTCACCGCGGCGGCTGGCACA" [3] "CCTGTTCGCTCCCCACGCTTTCGTGCCTCAGCGTCAGTTACAGTCCAGAAAGCCGCCTTCGCCACTGGTGTTCCTCCTAATCTCTACGCATTTCACCGCTACACTAGGAATTCCGCTTTCCTCTCCTGCACTCAAGAGTTGCAGTTTCAAATGCAGCCCCGGAGTTGAGCCCCGGTCTTTCACATCTGAC" [4] "CCCGTTTGCTCCCCCAGCTTTCGTACCTCAGCGTCAGTTGCAGGCCAGAGAGCCGCCTTCGCCACCGGTGTTCTTCCTGATATCTACGCATTCCACCGCTACACCAGGAATTCCGCTCTCCTCTCCTGCCCTCAAGAAGGATAGTTTCAATTGCTCTGTATCGGTTAAGCCGATACCTTTCACACTTGAC" [5] "CCTATTTGCTCCCCTAGCTTTCGTCTCTCAGTGTCAGTTTTAGCCCAGTAGAGCGCCTTCGCTACCGGTGTTCTTCTCGATCTCTACGCATTTCACCACTACACCGAGAATTCCCTCTACCCCTACTAAACTCAAGCTTTTTAGTTTCATACCGCTTTTTTAAGGTTAAGCCTTAAGATTTAACAGTAGA"

On Mon, Sep 9, 2019 at 11:22 AM Benjamin Callahan notifications@github.com wrote:

Could you post example denoised forward and reverse sequences? E.g. the output of:

dada2:::pfasta(getSequences(dadaFs[[1]])[1:5]) dada2:::pfasta(getSequences(dadaRs[[1]])[1:5])

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/837?email_source=notifications&email_token=ALDKV7OW37CABCITTJCHOB3QIZS2ZA5CNFSM4IUKLIZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6H7VPA#issuecomment-529529532, or mute the thread https://github.com/notifications/unsubscribe-auth/ALDKV7KT4XSF7BWDBAZGLVDQIZS2ZANCNFSM4IUKLIZA .

benjjneb commented 5 years ago

My first concern is that a good chunk of these sequences are not bacterial according to BLAST against nt.

For example the 4th and 5th forward sequences, and the 1st and 2nd reverse sequences, appear to be mitochondrial DNA, perhaps from a coral. The 2nd forward and 5th reverse sequences appears to be chloroplast DNA from green algae.

I can't see all your sequences, but this may indicate substantial amounts of off-target amplification in this data, which can cause problems and result in a lot of data loss at various steps. The amoutn of off-target amplification may also vary sample-to-sample, which could explain that pattern as well.

Jatbee32 commented 5 years ago

Hi Ben,

I may not have been clear before so I apologize for that. I did also blast those sequences and those are very common sequences for our dataset. The 16S samples are from Orbicella franski (a coral species) and although bacterial primers are used, are known to pick up chloroplasts and mitochondrial sequences along with the coral species' itself. Previously in this lab they have used QIIME to perform 16S analysis but wanted to start using dada2 for future projects, which is why we are having some issues troubleshooting. Would the fact these samples, which are known to have these types of off-target amplification cause the loss of data at various steps? If this might be the issue, how might I go about fixing some of the parameters to counteract the loss?

I am also attaching the quality scores and the error reports. I did not previously attach them, but maybe these might also link to some of the issues here? If there might be issues with the quality, could it cause a loss of reads?

Quality_Profile_Reverse_Filtered_Reads.pdf Quality_Profile_Forward_Filtered_Reads.pdf err_forward_reads.pdf err_reverse_reads.pdf

benjjneb commented 5 years ago

There are two issues that seem to be going on, one is the subset of samples where you are losing half or more of the reads at the denoising step, second is that you are losing half-ish of the reads in merging in most samples.

The merging in particular is surprising if this really is V4 data, as you should have plenty of overlap, but could be explained if there is a lot of non-target amplification, as the non-target amplicons could have longer/shorter lengths than expected.

Can you try to run your workflow again with dada(..., pool="pseudo") and see if there is any impact on the fraction of reads making it through the pipeline?

Also, if you could share a subset of your samples with me, that would also be helpful so I can take a closer look at what is going on.

benjjneb commented 5 years ago

Discussion moved offline.