benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 143 forks source link

Getting low merged reads #764

Closed rahulnccs closed 5 years ago

rahulnccs commented 5 years ago

Thanks for such a wonderful package. We have been using DADA2 in our lab since its release and completed 5 projects till date. Lately, I am dealing with one microbiome dataset and getting terribly less number of merged reads. The data is generated by Illumina 2*300 chemistry and is of good quality. However, as per info provided below, despite of having good number of filtered reads, we are getting terribly low merged sequences. What would be the reason, I am unable to figure it out. We tried with increased truncLen, still no improvement in the output.

  | input | filtered | denoisedF | denoisedR | merged | nonchim outputAU001 | 99374 | 51075 | 50539 | 50556 | 1 | 1 outputAU002 | 81643 | 43170 | 42750 | 42845 | 0 | 0 outputAU003 | 103017 | 63250 | 62518 | 62886 | 55 | 54 outputAU004 | 125042 | 57167 | 56967 | 56974 | 5 | 5 outputAU005 | 84017 | 41303 | 41063 | 40934 | 3 | 3 outputAU006 | 89357 | 910 | 696 | 896 | 0 | 0 outputAU007 | 75070 | 34757 | 34357 | 34524 | 2 | 2 outputAU008 | 106684 | 28745 | 28509 | 28476 | 0 | 0 outputAU009 | 105474 | 53395 | 52988 | 53201 | 2 | 2 outputAU010 | 94400 | 36068 | 35449 | 35825 | 14 | 14 outputAU011 | 101182 | 54081 | 53891 | 53743 | 0 | 0 outputAU012 | 36438 | 11385 | 11140 | 11083 | 0 | 0 outputAU013 | 90418 | 44372 | 43910 | 43998 | 0 | 0

For you insight I am providing quality plot profile and fastq files of some samples. Any help would be appreciated. outputAU001_S1_R1_001.fastq.gz outputAU001_S1_R2_001.fastq.gz outputAU006_S6_R1_001.fastq.gz outputAU006_S6_R2_001.fastq.gz

forward_read_qualitycutadapt reverse_read_qualitycutadapt

benjjneb commented 5 years ago

Could you also post the primer set used? And were the primers sequenced at the start of the reads?

rahulnccs commented 5 years ago

Thanks, @benjjneb for the quick response. We had targeted V3-V5 regions, the primer set is F357-R926. Yes, the fastqc showed that the data had Nextera transposase sequence as adapter content. However, I did trim the barcodes using cutadapt, I presume that would also trim the adapter sequence too.

benjjneb commented 5 years ago

I have not used this particular primer set, but I suspect the issue is that is is so long that your reads barely overlap. Going by the E. coli coords, 926-357 = 569nts of length for this amplicon. The primers are sequenced, and potentially some adapter as well. Indeed, when I inspect alignments of the most abundant raw sequences I get this:

fnF <- "~/Desktop/outputAU001_S1_R1_001.fastq.gz"
fnR <- "~/Desktop/outputAU001_S1_R2_001.fastq.gz"
sqF <- getSequences(derepFastq(fnF))
sqR <- getSequences(derepFastq(fnR))
unname(outer(sqF[1:5], rc(sqR[1:5]), nwhamming))
[,1] [,2] [,3] [,4] [,5]

[1,] 17 16 16 18 16 [2,] 76 77 77 0 77 [3,] 79 80 80 0 80 [4,] 79 80 80 0 80 [5,] 77 78 78 0 78

nwalign(sqF[2], rc(sqR[4]))

[1] "CCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTAGCGTGAAGGATGAAGGTCCTACGGATTGTAAACTTCTTTTATACGGGAATAAAGTATCCTACGTGTAGGATTTTGTATGTACCGTATGAATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGCAGACGGGAGATTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGG-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------" [2] "-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------CCGTAAAATTGCAGTTGATACTGGATATCTTGAGTGCAGTTGAGGCAGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCCTGCTAAGCTGCAACTGACATTGAGGCTCGAAAGTGTGGGTATCAAACAGGATTAGATACCCTGGTAGTCCACACGGTAAACGATGAATACTCGCTGTTTGCGATATACGGCAAGCGGCCAAGCGAAAGCGTTAAGTATTCCACCTGGGGAGTACGCCGGCAACGGTGAAACTCAAATGAATTGACGG"

With apologies for the crappy formatting, what this shows is that these reads do overlap, but only by 24 nts. Thus, in order to process this data and expect merging later on, you basically cannot truncate any reads off the end.

IMO, while this is pretty good quality 2x300 data, the qualities still get pretty low at the end fo the reverse reads, so having to keep all of that will hurt your sensitivity to low frequency stuff. Thus, I would consider using the forward reads alone for this data.

dhrpa commented 4 years ago

@benjjneb Hello, I am facing a similar issue. My read count goes down drastically after the mergers step.

The data I have is V3-V4 sequenced to create a 460bp amplicon. Assuming the data is demultiplexed I did not use cut adapt or trimmomatic and started directly with DADA2. I have been using version 1.12. I am attaching my plot quality for your reference. I trimmed the data at 220 and 180.

qualityF_2.pdf qualityR_2.pdf

I can email you some fastq files if that will be helpful.

Any help would be appreciated.

Thank you.

benjjneb commented 4 years ago

@dhrpa

You truncated down to 220+180 = 400 total bps. You have a 460 bp amplicon. They don't overlap after that truncation.

dhrpa commented 4 years ago

@benjjneb Okay okay, But looking at the quality as it declines what would you suggest I do? Do I truncate it less than that so? How should I go about it?

benjjneb commented 4 years ago

From the tutorial:

Considerations for your own data: Your reads must still overlap after truncation in order to merge them later! The tutorial is using 2x250 V4 sequence data, so the forward and reverse reads almost completely overlap and our trimming can be completely guided by the quality scores. If you are using a less-overlapping primer set, like V1-V2 or V3-V4, your truncLen must be large enough to maintain 20 + biological.length.variation nucleotides of overlap between them.

So, you need a total truncLen of ~475 or higher. So you can't truncate as much as you are doing. Ideally you'll pick a forwafd and reverse truncation length that add up to let's say 480 to be safe, and that remove as much low quality read tail as possible. Typically reverse reads are worse quality, so probably a higher forward truncation length than reverse.

dhrpa commented 4 years ago

@benjjneb Thank you very much, I will try to lower the trunc length and try again.

natashaztarora commented 3 years ago

Dear Benjamin,

Thank you very much, we are also using DADA2 for our forensic microbiome pipeline.

We have done 2 x 300 Illumina sequencing for the V1-V3 (F27/R534 primers), and V3-V5 (F357/R926). As in the first post by Rahul, the length of V3-V5 is so long that we have very few reads merging. You asked him whether primers were sequenced: in our case they are too. Is it possible NOT to sequence them? I am beginning to wonder why anyone (including ourselves) would want to do Illumina PE sequencing for the V3-V5. Unless one uses a different primer set or finds a way to get an overlapping sequence.

As you recommended in Rahul's case, I see that we will also have to use only the forward reads (but that means we will probably only get V3 and partial V4). From the info in post #857 I understand that sequence length variability at V3-V4 is to be expected. In such a case, would it make sense to use minLen (eg below 100) instead of truncLen in the filterAndTrim for the forward reads?

Thank you ever so much for all your help here, you are invaluable to the field!

Natasha

benjjneb commented 3 years ago

Is it possible NOT to sequence them?

It is, it just depends on how the initial library is setup for the sequencer. Other very common 16S protocols, e.g. the EMP V4 protocol, do not sequence the primers.

From the info in post #857 I understand that sequence length variability at V3-V4 is to be expected. In such a case, would it make sense to use minLen (eg below 100) instead of truncLen in the filterAndTrim for the forward reads?

You should still use truncLen here. There is no worry about the opposite primer showing up at the ends of the reads, as the V3-V5 region is always longer than 300nts.

natashaztarora commented 3 years ago

Dear Ben,

I did not quite understand the issue with the opposite primer, may have missed something here, since we would want to use the forward reads only (for V3V5 region, given the issues merging). Most forward reads are ca. 284 bp, but some are shorter eg 175 bp, 154 bp (below)

Screenshot 2021-05-11 at 14 26 11

These forward reads probably cover V3 and also partly V4 regions.

From post #857 I gather that for V3-V4, biological length variation is to be expected (JoBrz had peaks at ca. 440 and 460 bp).

If we use truncLen won't we lose a lot of valuable sequence info? And where would we truncate anyway in such a case?

Thank you very much for your help!

benjjneb commented 3 years ago

@natashaztarora If the amplicon you are using is V3V5, then the amplicon should always be longer than 300 nts. Therefore, every forward read from that amplicon will contain only real DNA all the way out to the end of the read (discounting the forward primer at the start of the reads).

The question then is why you are seeing length variation in your reads at all.

What pre-processing was done on these reads between getting them off the sequencer (when presumably they were all 300 nts long) and when you are inspecting them now?

natashaztarora commented 1 year ago

Dear Ben, It took us a while to get back to you but we have been doing quite a bit of forensics here for this issue. To answer your question above, after obtaining the raw reads we ran Cutadapt to remove the V3V5 primers (F357/R926) and any other sequences upstream of these. We ran blastn on some of the shorter sequences and found matches with human DNA. We then also used bwa to map the reads (for a few samples) to the human genome, again finding matches. Quickly checked some of the matches and found numerous short length sequences eg 157 bp but I have not yet calculated how many are this short and how many are ca. 280 bp. I will do that and get back to you, in case there is a link.

Since I wrote to you we have been doing all kinds of other tests, comparing the results of V1V3, V3V4, V4 and V4V5. We are in the processing of establishing an SOP for microbiome forensics (different applications). To do this we have to assess which 16S rRNA gene region (and sequencing kit) is optimal (lower costs, more merged reads, etc). But also what has been used so far in larger studies such as those of the HMP. Given the extensive data available for the V1V3 and V3V5 it would be ideal if we could continue with one of these regions. However, for some reason, we get suboptimal results for the V1V3 when doing MiSeq paired end sequencing (2 x 300 bp) with the F27 and F534 primers: the qualities of the forward and reverse reads are such that we would want to truncate at 270 and 200 bp, but then we can no longer merge. So we would only be able to use the forward reads. I am unsure how others have managed to successfully sequence and merge V1V3 sequences.

Thank you very much for your help (and patience).

Natasha

benjjneb commented 1 year ago

Thanks for update Natasha.

Two quick comments: Off-target amplification definitely can produce length variation outside the expected range, so your report that you are seeing a lot of human DNA in the unexpected-legnth ASVs makes sense. Both direct depletion of known contaminating sources (like human) and restricting ASVs in the final table to a predefined length range are dry lab ways to combat that.

I am unsure how others have managed to successfully sequence and merge V1V3 sequences.

It is common for folks to find full V1V3 ASV-level analysis challenging. The way it works is either have a very high quality sequencing provider, thus getting 2x300 sequences that maintain good quality beyond position 200 in the reverse reads, or working with just the forward reads alone.

natashaztarora commented 1 year ago

Thank you very much, it is helpful to know we are not the only ones finding full V1V3 ASV-level difficult. For data generated in the past, we will use the forward reads (we are doing a comparison based on different regions of the 16S rRNA gene). Really appreciate that we have a forum here where we can get help. Best regards Natasha