benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
464 stars 142 forks source link

Failing to merge (16S v5-v7) #1108

Closed Claudioatb closed 3 years ago

Claudioatb commented 4 years ago

Dear Benjamin,

I have been struggling with the analysis of my data (more specifically in failing to merge the forward and reverse sequences), and since I can’t find a solution for my problem, I do have to ask for some help! Sorry for bothering and thank you very much in advance for your time!

Relevant information: Gene sequenced: 16S (v5-v7 region) Platform: MiSeq (2x300) Primers: 799F2 + 1193R Primer removal: Cutadapt Software: R 3.6.1

Overlapping With this primer pair, we should get an amplicon of ~394bp (1193-799). Since we have 2x300 sequencing, we should have an overlapping region of 206bp (600 - 394). With these values, and looking at the quality plots, I decided to trim the sequences at position 250 for the forward, and at position 200 for the reverse, roughly keeping the quality scores above 30-ish. With this trimming values, I should end up with a total of 56 nucleotides overlapping, which should be plenty for dada2 to merge (250+200-394).

Summary tab Anyway, I am failing to merge most of the sequences, as you can see in the following summary tab:

  dada2_input | filtered | dada_f | dada_r | merged | nonchim | final_perc_reads_retained

a112-1 | 90065 | 63944 | 50133 | 59608 | 23194 | 19402 | 21.5 a112-2 | 98481 | 69092 | 56539 | 65055 | 29811 | 22493 | 22.8

So far I have tried:

None of the steps seemed to solve the issue, so I have no clue on where to proceed from here.

Do you have any idea on how to proceed from here?

benjjneb commented 4 years ago

Are the primers sequenced? That is, are the primers present on the forward and reverse reads? If so they need to be removed, and it may also affect the calculation of the amplicon length (the relevant length here is the "sequenced" amplicon length, which will be longer by the total FWD and REV primer lengths if they were sequenced).

An easy way to check is to simply inspect the start of the forward reads in one sample, and compare them to the forward primer.

Claudioatb commented 4 years ago

Yes, the primers were sequenced and they are present at the start of the reads. I have used cutadapt to remove them, but I had the following issue:

1060

Also, I have tried to use the "trimLeft" instead of cutadapt and the result is about the same:

    input filtered denoisedF denoisedR merged nonchim

a112-1 90356 60945 48933 56855 18424 15861 a112-2 102029 66085 54972 62198 24843 19453

benjjneb commented 4 years ago

What is the exact filterAndTrim command you are using in the no-cutadapt scenario?

Claudioatb commented 4 years ago

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(270,220), maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, trimLeft=20, compress=TRUE, multithread=FALSE)

benjjneb commented 4 years ago

Another possibility is that there is a significant amount of non-target amplification here. A quick way to test that is to look at taxonomic assignments for the top 10 ASVs from the forward and reverse reads (independently) in a sample. Basically:

taxF <- assignTaxonomy(getSequences(ddF[[1]])[1:10], "path/to/reference.fasta")
taxR <- assignTaxonomy(getSequences(ddR[[1]])[1:10], "path/to/reference.fasta")
unname(taxF)
unname(taxR)

Does that look as expected?

Claudioatb commented 4 years ago

For the forward reads, it looks fine, but for the reverse, there's no taxonomic attribution!!

Here's the taxF output

  [,1]       [,2]               [,3]                  [,4]                

[1,] "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Steroidobacterales" [2,] "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Steroidobacterales" [3,] "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Steroidobacterales" [4,] "Bacteria" "Desulfobacterota" "Desulfobacteria" "Desulfobacterales" [5,] "Bacteria" "Proteobacteria" "Gammaproteobacteria" "SZB50"
[6,] "Bacteria" "Proteobacteria" "Gammaproteobacteria" NA
[7,] "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Burkholderiales"
[8,] "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Rhizobiales"
[9,] "Bacteria" "Actinobacteriota" "Acidimicrobiia" "Microtrichales"
[10,] "Bacteria" "Proteobacteria" "Alphaproteobacteria" "Kiloniellales"
[,5] [,6]
[1,] "Woeseiaceae" "Woeseia"
[2,] "Woeseiaceae" "Woeseia"
[3,] "Woeseiaceae" "Woeseia"
[4,] "Desulfosarcinaceae" "Desulfatibacillum"
[5,] NA NA
[6,] NA NA
[7,] "Burkholderiaceae" "Ralstonia"
[8,] "Rhizobiaceae" "Cohaesibacter"
[9,] "Microtrichaceae" "Sva0996_marine_group" [10,] "Fodinicurvataceae" "Fodinicurvata"

Whereas the taxR outup returns basically only NA's: [,1] [,2] [,3] [,4] [,5] [,6] [1,] "Eukaryota" NA NA NA NA NA
[2,] "Eukaryota" NA NA NA NA NA
[3,] NA NA NA NA NA NA
[4,] "Eukaryota" NA NA NA NA NA
[5,] "Eukaryota" NA NA NA NA NA
[6,] "Eukaryota" NA NA NA NA NA
[7,] NA NA NA NA NA NA
[8,] NA NA NA NA NA NA
[9,] NA NA NA NA NA NA
[10,] NA NA NA NA NA NA

Anyway, so far I haven't suspected there could be something wrong with the reverse sequences, as I did manually check them, looked for the primers, found them, managed to remove them from where they should have been removed, and the quality plots looked fine to me (even though I am new to bioinformatics). Though, when I blast random sequences from the reverse files (maybe 50 in total), they all returned "uncultured bacterium" .

I am not sure what this means!

benjjneb commented 4 years ago

Hm... the Blast results in particular seem troubling. Can you post the output of:

dada2:::pfasta(getSequences(ddR[[1]])[1:10])
Claudioatb commented 4 years ago

Just to make sure, the object "ddr" are the denoised sequences (from the dada object), right?

If so, here's the output of the line above:

1 TGTCACCGGCAGTCTCCCTAGAGTTCCCACCCGAAGTGCTGGCAACTAGGGACAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAGCACCTGTCACTCAATTCCCGAAGGCACCAAGTTATCTCTAACAAGTTCTGAGGATGTCAAGGGCAGGTAAGGTTCTTCGCGTTGCATCGAA 2 GAGTTGACCCCGGCAGTCTCCTACGAGTCCCCACCATTACGTGCTGGCAACATAGGACGAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCACCACCTGTGTACCGCCCCGAAGGACACCATATCTCTATGGCTTTTCGGTACATGTCAAACCCAGGTAAGGTTCTTCGCGTTGCCTCG 3 TGTCACCGGCAGTCTCCCTAGAGTTCCCACCCGAAGTGCTGGCAACTAAGGACAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAGCACCTGTTTTCCAGGCCGAAGCACCCCCGTATCTCTACAGGGTTCTGTCAATGTCAAGGGCTGGTAAGGTTCTTCGCGTTGCATCGAATTA 4 CCGTTGACCGGGGCAGTATCTTTAGAGTCCCCACCCGAAGTGCTGGTAACTAAAGATAGGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAGCACCTGTCATCGGGCTCCCCGAAGGGCACTTTCTACTTTCATAGAAATTCCCGAGATGTCAAACCCAGGTAAGGTTCTGCGCGTTGC 5 TGTCACCGGCAGTCTCCCTAGAGTTCCCACCATTACGTGCTGGCAACTAAGGACAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAGCACCTGTTTTCCAGGCCGAAGCACTCCCGTATCTCTACAGGATTCTGTCAATGTCAAGGGCAGGTAAGGTTCTTCGCGTTGCATCGAATT 6 TGTCACCGGCAGTCTCTCTAGAGTGCCCTTTCGTAGCAACTAGAGACAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAGCACCTGTGTCCACTTTCTCTTTCGAGCACCTAATGCATCTCTGCTTCGTTAGTGGCATGTCAAGGGTAGGTAAGGTTTTTCGCGTTGCATCGAATTA 7 TGTCACCGGCAGTCTCCCTAGAGTTCCCACCCGAAGTGCTGGCAACTAGGGACAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAGCACCTGTTTTCCAGGCCGAAGCACCCCCGTATCTCTACAGGGTTCTGTCAATGTCAAGGGCAGGTAAGGTTCTTCGCGTTGCATCGAATTA 8 TGTCACCGGCAGTCTCCTTAGAGTTCCCACCATAATGTGCTGGCAAATAAGGACAAGGGTTGCGCTCGTTACGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAGCACCTGTCACTGCGTTCCCGAAGGCACCAATCTATCTCTAGAAAGTTCGCAGGATGTCAAGGGGTGGTAAGGTTCTTCGCGTTGCATCGA 9 GAGTTGACCCCGGCAGTCTCCTACGAGTCCCCAACATAACTTGCTGGCAACATAGGATAAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCACCACCTGTGTAAGGCCCCGAAGGACACCCTATCTCTAGAGCTTTTCCCTACATGTCAAACCCAGGTAAGGTTCTTCGCGTTGCATCG 10 TGTCACCGGCAGTCTCCCTAGAGTTCCCGGCCGAACCGCTGGCAACTAGGGACAAGGGTTGCGCTCGTTACGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAGCACCTGTCACCGCGTTCCCGAAGGCACCAATCCATCTCTGGAAAGTTCGCGGGATGTCAAGGGCTGGTAAGGTTCTTCGCGTTGCATCGAA

jgrzadziel commented 4 years ago

Hey @Claudioatb and @benjjneb , I have a similar issue (not always, but sometimes) using V3/V4 region.

The code for trimming: out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen = , trimLeft = c(20,20), trimRight = c(30,60), maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=T, verbose = TRUE)

After that step, about 75 - 80% of reads are passed, but after forward-reverse merging, only 20-30 % are successfully merged. Taking the fastq files from filtered folder and using for example BBMerge or another tool, 99% reads are being successfully merged. I don't know where these differences come from...

I'm using the same settings in DADA2 as well as the abovementioned tools (the same minimum overlap and maximum differences).

dada2 version - 1.14

update: I have uploaded two files (fw and rev) of one sample that caused merging problem:

https://app.box.com/s/5hdjg0fvliqkjxrcivvo9vqfebkc5n2y

benjjneb commented 4 years ago

update: I have uploaded two files (fw and rev) of one sample that caused merging problem

Thanks! Can you clarify, are these pre-filtering or post-filtering?

benjjneb commented 4 years ago

@jgrzadziel One more question, in your filterAndTrim command above, there apppears to be an unfinished truncLen= flag. Can you double-check and perhaps repost the exact command you are executing?

jgrzadziel commented 4 years ago

@benjjneb You are right, the exact command is as follows:

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft = c(20,20), trimRight = c(30,60), maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=T, verbose = TRUE)

And these files are pre-filtering, just after demultiplexing by sequencing provider.

benjjneb commented 4 years ago

@jgrzadziel I ran the following commands on the provided example files, and got a much higher fraction of merged reads:

library(dada2); packageVersion("dada2")

[1] ‘1.17.3’

setwd("~/Desktop/merge")
fnF <- "1K3-16S_S27_L001_R1_001.fastq.gz"
fnR <- "1K3-16S_S27_L001_R2_001.fastq.gz"

filtF <- file.path("filtered", fnF)
filtR <- file.path("filtered", fnR)

out <- filterAndTrim(fnF, filtF, fnR, filtR, 
                     trimLeft = c(20,20), trimRight = c(30,60), 
                     maxN=0, maxEE=2, truncQ=2, rm.phix=TRUE, 
                     compress=TRUE, verbose = 0)

errF <- learnErrors(filtF, multi=TRUE)
errR <- learnErrors(filtR, multi=TRUE)
ddF <- dada(filtF, errF, multi=TRUE, verbose=0)
ddR <- dada(filtR, errR, multi=TRUE, verbose=0)

mm <- mergePairs(ddF, filtF, ddR, filtR, verbose=TRUE)

92767 paired-reads (in 2666 unique pairings) successfully merged out of 109525 (in 8771 pairings) input.

So, not an amazing merge percentage (may suggest a complex community and some rare variants are failing to be detected in the lower quality reverse reads), but it's about 85% being merged successfully.

jgrzadziel commented 4 years ago

@benjjneb thanks, you are right, when I run the same commands only on these two files (one sample) I have similar results. The problem arrives when I run commands (exactly the same) for all files that I have in one project (27 samples), and then, instead of 85% I have 30-50% successfully merged reads (edit: I have looked a bit closer, and 85% is the percentage of merged reads out of post-filtered data, but when it comes to pre-filtered data it is also about 55-56%).

--edit:

So after a couple of tests, the results are always the same. No matter that I use 1.14 version or 1.16, with or without derep step, a single sample or bunch of samples, I always finish with ~85% of merged reads (merged vs post-filtered) or 55% (merged vs pre-filtered) reads. So it looks like the quality of that sequencing run is not particularly the highest. However, as I mentioned before, using other tools I have 98-99% of merged reads (merged vs post-filtered) using almost the same settings.

Claudioatb commented 4 years ago

When I ran the same commands, I continue to get a lot of reads that fail to merge:

          input       filtered  denoisedF denoisedR      merged

d312-1 102620 69709 56789 63699 22622

@benjjneb if you don't mind, could you please take a look at the two files I uploaded? They're fwd and rev of the same sample:

GC-PK-8834-a112-1_S139_L001_R1_001.fastq.gz GC-PK-8834-a112-1_S139_L001_R2_001.fastq.gz

Thanks a lot in advance.

Cláudio

benjjneb commented 4 years ago

@Claudioatb I'm getting a bit confused with the multiple people in this thread, so apologies if you already didi this...

Thank you for providing these files, but can you inspect the dependence on truncLen first. That is, Can you post the read stats for your current filterAndTrim parameters, and then post the same stats when you increase the filterAndTrim parameters by increasing truncLen by a net say 40 bases? That would help narrow down whether this is just a too-short-to-merge issue, or is something else.

Claudioatb commented 4 years ago

@benjjneb no worries, I did try to increase the filterAndTrim parameters up to 30 bases, and still no luck. Here it goes:

truncLen = c(240, 190) , maxEE(2,2)

  dada2_input filtered dada_f dada_r merged nonchim % retained
a112-1 90065 62110 51665 58290 21860 18291 20.3
a112-2 98481 67318 57683 63851 28313 21459 21.8

truncLen = c(260, 200) , maxEE(2,2)

  dada2_input filtered dada_f dada_r merged nonchim % retained
a112-1 90065 59840 45979 55785 17542 15791 17.5
a112-2 98481 64660 51821 60912 24345 18940 19.2

Anyway, I will now increase the the trunclen with the remaining 10 bases and update you as soon as I have the read stats.

benjjneb commented 4 years ago

First impression, there are a low number of repeated observations of any sequence in this data. What environment are you sampling here? And, do you or your lab have previous experience with this PCR/primer design, especially with respect to associated error rates?

Claudioatb commented 4 years ago

Hi @benjjneb ,

Sorry for the delay in getting back to you. Our samples originate from coral reef sediments and previous results from another site on the Great Barrier Reef, using the same primers and PCR design but an OTU-based analysis in Qiime, suggested the presence of >3000 OTUs in a set of three samples (each triplicated). We chose this primer set in order to minimize chloroplast amplification, and they are also commonly used by other labs. Also, we did not seem to have problems in merging forward and reverse sequences then.

I am not sure I understand what error rates you are referring to. Could you please explain?

Cláudio

benjjneb commented 4 years ago

@Claudioatb

I don't have a definitive answer, but I think a couple things are going on. In short, I think a mix of artificial technical variability (perhaps introduced by cutadapt) and the very high diversity of these samples is making it relatively difficult for DADA2 to identify the rare members of these samples.

First as background, DADA2 is relying on repeated observations of the same sequence in order to infer its presence. Thus, at least under default settings, there is a sensitivity limit of at least 2 error-free reads from an ASV in order to identify it. If the data is too hyper-diverse this approach will start to fail, in the sense that only the most abundant community members will be identified. In your data, there is a high level of read-level diversity, as indicated by a number of unique sequnces nearly on order with the number of reads in your filtered data:

> drpF
derep-class: R object describing dereplicated sequencing reads
$uniques: 62727 reads in 47118 unique sequences
  Sequence lengths: min=201, median=201, max=201
> drpR
derep-class: R object describing dereplicated sequencing reads
$uniques: 62727 reads in 50187 unique sequences
  Sequence lengths: min=202, median=202, max=202

This is probably the proximate cause of denoising that is less successful than desired, which is then causing a relatively low merge percentage (essentially many rare variants aren't being merged because they don't exactly match after denosing).

But, what is the ultimate cause of this hyper-diversity of sequencing reads?

One contributor may be technical variability. For example, when inspecting the most abundant reads in the data, I also find length-shifted versions of those reads present, i.e. the exact same sequence but with the start position shifted one nucleotide forward or backwards (code to see this):

sqF <- getSequences(drpF)
ham <- nwhamming(sqF[[2]], sqF, vec=TRUE)
ii <- which(ham==0) # Same sequence up to shifts
unname(drpF$uniques[ii])
sapply(ii, function(i) nwalign(sqF[ii[1]], sqF[i], vec=TRUE))

That certainly isn't helping things, and its possible that could be being caused partly by cutadapt. One thing to try would be to remove the cutadapt step and just trim off a fixed number of bases at the start of the reads corresponding to the primer lengths (but this only works if you have fixed length primers, and no heterogeneity spacers).

The other possible contributor, which I think is also likely, is that the "community" being sampled here has such hyper-diverse sequence content that much of the data is made up of sequence variants with just one read sampled from them. One way to try to deal with this is to run DADA2 in detect-singletons mode:

ddF.s <- dada(filtF, errF, DETECT_SINGLETONS=TRUE, multi=TRUE)
ddR.s <- dada(filtR, errR, DETECT_SINGLETONS=TRUE, multi=TRUE)

This did help, but only marginally -- I got about a 10% increase in merging sequences that way, but it was still <50% overall.

Claudioatb commented 4 years ago

Dear @benjjneb ,

Thank you for your answer. I will look carefully to those suggestions you made, and get back to you in a few days (as I am currently analyzing other ITS datasets).

I am not aware of heterogeneity spacers being used with my data, but I will double check this again anyway. The primers we used were 799F2 (5′-AACMGGATTAGATACCCGG-3′) and the 1193R (5′-ACGTCATCCCCACCTTCC-3′). When I tried to trim the primers instead of using cutadapt (without much improvement in the % of merged reads), i used the following code:

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(270,220), maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, trimLeft=20, compress=TRUE, multithread=FALSE)

I wonder if I should use trimLeft=c(19,18) instead, as the number of nucleotides are different in FWD and REV primers...

Anyway, out of curiosity, I did try to run the filtering step on my 16S dataset with the ITS-approach (by not defining a fixed length with the truncLen argument and letting it trim based on the quality scores). That way, I ended with >50% of merged reads (I did not record this values as this was just out of curiosity, so I might be overestimating the merging percentage). Anyway, would you think that this way, dada2 was able to deal with the length-shifted versions of those reads you found?

Will get back to you after I do the tests you suggest. Thanks a lot for you help!

Cláudio