benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Most data failed to merge #648

Closed Wizardren closed 5 years ago

Wizardren commented 5 years ago

Hi Ben, I am working with DADA2 for Illumina Hiseq 2x250 sequence analysis. There seems to be some problem in merging reads operation. As shown in below, most reads are lost.

input filtered denoiF denoiR merged 40832 37202 29204 31495 10735 38017 34840 27307 29507 10346 40114 36584 28851 31344 11909 39223 35788 27830 30454 11204 31978 29273 23483 25713 8798

I checked the dadaFs and mergers object, about 1/3 infered forward and reverse ASVs failed to merge, which I think was not so bad. Whether this is because in mergePairs, the original reads were used, which might resulted in high mismatch rate? I tried both no truncation and moderate truncation, the results were consistent.

Should I accept the result or would you have some suggestions? Thank you!

benjjneb commented 5 years ago

Usually low merging rates are because reads were truncated too short to overlap.

What truncation lengths (i.e. truncLen) did you use in the filter and trim step? What amplicon are you sequencing, and are the primers included on it? Do you know the expected range of length of the sequenced amplicon?

Wizardren commented 5 years ago

Thank you for the prompt reply! The target sequence was 16s v3-v4 region (376bp excluding primer). Primer (515F & 907R) was removed by using trimleft=c(16,32) I tried both no truncation and truncLen=c(230,240), which got consistent results.

benjjneb commented 5 years ago

Can you clarify: Is it V3V4? Or V4V5? The primers you gave are for V4V5.

If it is V4V5 using the 515F/907R primers then your truncation lengths should be long enough for merging, but if its really V3V4 then they won't be.

When you say "consistent results" with no truncation, you mean that you also saw a very low merge rate in that scenario?

Wizardren commented 5 years ago

Sorry, it's V4V5 region. Yeah, I tried no trunction, the merged reads were comparable with truncLen=c(230,240) scenario.

Wizardren commented 5 years ago

Also, I noticed the sequence error rate estimation by learnErrors was unusual.

Whether that will lead to the failed merge?

image

benjjneb commented 5 years ago

That does look unusual at least.

For a typical sample in this dataset, what is printed out when you dereplicate? I.e. what is printed to screen from derepFastq(fnF, verbose=TRUE) and derepFastq(fnR, verbose=TRUE)?

Could you also post the results of plotQualityProfile on one of the samples as well (F and R)?

Wizardren commented 5 years ago

Identical read seemed to be uncommon. image Below is three sample's quality profile before trim.

image

Here is the profile after filterAndTrim image

Wizardren commented 5 years ago

Hi,Ben I have 10 sample data attached. Could it be convenient for you to help check the problem? Thank you very much! The primer sequence remove was conducted by "trimLeft=c(32,16)". One thing need be concerned is that the forward reads (in split 1) was acturally reverse sequences (amplified by 907R), and vice versa. I'm not sure if this can affect the subsequent analysis.

Thank you very much for your help!

Best, Wizard

--

Ren Weizheng Postdoctor College of life sciences Zhejiang University Hangzhou, 310058, China

At 2019-01-09 20:49:09, "Benjamin Callahan" notifications@github.com wrote:

That does look unusual at least.

For a typical sample in this dataset, what is printed out when you dereplicate? I.e. what is printed to screen from derepFastq(fnF, verbose=TRUE) and derepFastq(fnR, verbose=TRUE)?

Could you also post the results of plotQualityProfile on one of the samples as well (F and R)?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

从网易126邮箱发来的云附件 Sample_Wizardren.rar (74.98M, 2019年1月24日 22:04 到期) 下载

benjjneb commented 5 years ago

Can you try one thing: What happens if you run through the pipeline using a subset of your data (perhaps the 10 samples) but using the pool=TRUE or pool="pseudo" options when performing the denoising step. Does that result in increased numbers of merged reads at the end?

Those modes will take longer to run, but should still be pretty easily tractable on 10 samples.

Wizardren commented 5 years ago

I tried a 10 susample analysis with pool=TRUE, the result was still frustrating. Below is the track table and plotErrors

Sample input filter denoiF denoR merge nochim BK.RF1 40832 37202 32776 34213 13225 11568 BK.RF2 38017 34840 31110 32331 13140 11540 BK.RM1 40114 36584 32864 34023 14472 12100 BK.RM2 39223 35788 31711 33216 13745 11504 DP.RF1 31978 29273 26557 27636 11269 9701 DP.RF2 42387 38739 35108 36214 15480 12982 DP.RF3 31752 28833 26146 27181 11125 9677 DP.RF4 35063 31929 28658 29686 11247 9795 DP.RM1 29487 26781 24206 25215 9591 8285 DP.RM2 27396 25095 22740 23781 9127 7724

image

benjjneb commented 5 years ago

Hm....

Can . you share a single sample? Your prevoius message said you attached 10 samples, but I don't think it worked (can't attach large files to GH threads). You could send it to me via email as well: benjamin DOT j DOT callahan AT gmail DOT com

benjjneb commented 5 years ago

Thanks for sending the example samples. I am seeing basically the same thing you are. Using pool=TRUE I can up the merge rate from ~30% to ~38%, but that is still quite low, and I don't see an obvious reason why (e.g. no low complexity sequences, no other obvious junk DNA).

That is unusual in my experience for V4V5 data, which usually processes quite cleanly. That said, there are lots of idiosyncracies that can crop up in sequencing data, and I'm not seeing any other major red flag wrt the algorithm working as it should besides the low merge rate.

Using just the forward reads is also an option. My suggestion would be to do a basic comparison of forward-only and merged results on a few samples. If they are similar wrt the metrics important to your study, you can use either, otherwise use the one that is more appropriate.

ps: Actually one other option: You can allow mismatches in merging (default is to toss everything that doesn't perfectly overlap) with the mergePairs(..., maxMismatch=1) (e.g.) parameter. I don't suggest raising this parameter very high, but upping it to 1 or 2 may make sense here.

Wizardren commented 5 years ago

Thank you for you suggestions! Think that will help me work out it.

microDM commented 5 years ago

I am facing the same issue: I have sequence data for V3-V4 region (paired end) which have primers of 19 and 22 nt long. I have trimmed sequences using following command: filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(260,200), maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE,verbose = T,matchIDs = T,trimLeft = c(19,22)) The track of number of reads was showing like this: input filtered denoisedF denoisedR merged nonchim 047B01H 76642 68449 68449 68449 349 326 047B07H 87026 66492 66492 66492 1452 1403 401B01H 75366 66988 66988 66988 265 259 401B07H 135293 106850 106850 106850 1380 1303 425B01H 71868 47981 47981 47981 31334 31296 425B07H 84550 66479 66479 66479 15313 14505 427B01H 74728 56817 56817 56817 18262 18220 427B07H 119190 90431 90431 90431 6897 6784

I have tried to merge reads using PEAR, which gives a significant amount of merging but in sequence quality at the middle of reads. I have 2 questions:

  1. How to tackle this problem? I have tried with minOverlap and maxMismatch parameters.
  2. Can I merge sequences with PEAR and continue using single-end sequencing pipeline of dada2?
benjjneb commented 5 years ago

@drdee255 V3V4 is about 460 nts long when primers are sequenced, so your truncation lengths are too short. Too short truncation is almost always the cause of low merge success. Try increasing the truncation lengths by 40nts total between forward and reverse reads.