benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
464 stars 142 forks source link

Variable loss of large number of reads at mergePairs step #1230

Closed MatS792 closed 3 years ago

MatS792 commented 3 years ago

Hi Benj, I am wondering why I am losing such a big amount of reads after the merging.

For the analysis, I used a workstation with an i7-9th generation processor,32gb of ram, and Ubuntu 20.04, and dada 1.14.1. I am trying to analyze 16S rRNA sequences of about 100 - 150 samples. V4-V5 regions were amplified with universal primers: 515FB = GTGYCAGCMGCCGCGGTAA, 926R = CCGYCAATTYMTTTRAGTTT . The first time I run the dada2 pipeline (1.14) a lot of reads didn't pass the merging step, giving me 0 mergings. To solve this, I tried to change the filter and trimming settings (truncLen and maxEE) but I fall always in the same problem. It appears that depending on the parameters a subset of the sequences gave good results.

Therefore I've divided the fastq into subgroups with the aim to do the analysis at different times and merging them when I got a satisfying result.

I analyzed the first group, the heaviest in terms of megabytes. The quality plots for Fw QplotF and Rw QplotR

I got the best result with these filter and trim settings: filterAndTrim(fnFs1, filtFs1, fnRs1, filtRs1, truncLen=c(260,200), maxN=0, maxEE=c(4,8), truncQ=2, rm.phix=TRUE, trimLeft=c(20,21), compress=TRUE, multithread=TRUE) OUT:   reads.in reads.out
  G34_S74_L001_R1_001.fastq.gz 31589  29854   
  G34_S74_L001_R1_001.fastq.gz 49844  47163   
  G36_S86_L001_R1_001.fastq.gz 21408 20341   
G41_S15_L001_R1_001.fastq.gz 2860  2713   
G45_S169_L001_R1_001.fastq.gz 3125  2562   
G47_S39_L001_R1_001.fastq.gz 3497  3284   

With 93% of reads in average that pass the filtering&trim step (function used → mean(out1[,2]/out1[,1])).

For the dada steps I used: dadaFs1 <- dada(derepFs1, err=errF1, pool="pseudo", multithread=TRUE) dadaRs1 <- dada(derepRs1, err=errR1, pool="pseudo", multithread=TRUE)

For the merging step I got the best result using: minOverlap = 10 and maxMismatch=5 : mergers1 <- mergePairs(dadaFs1, derepFs1, dadaRs1, derepRs1, minOverlap = 10, maxMismatch = 5, verbose=TRUE)

  input filtered denoisedF denoisedR merged
G32 31589 29854 29783 29741 29680
G34 49844 47163 47059 47015 46766
G36 21408 20341 20301 20207 17834
G41 2860 2713 2695 2683 2472
G45 3125 2562 2479 2493 2406
G47 3497 3284 3246 3233 3225

With 92% of reads in average that were merged (function used → mean(track1[,5]/track1[,2])).

The problems began with the second subgroup, about 20 samples. The quality plot for Fw QplotF1 and Rw QplotR1

Since their low quality, samples G35, G53, L9, L52, L56, and S55A were excluded from the analysis.

For this subgroup, I tried several settings for the filter&trim, and the merging step. An example of track object is here:

  input filtered denoisedF denoisedR merged
G25 3020 2985 2931 2894 1113
G49 1582 1544 1470 1432 1413
G51 1662 1637 1536 1521 1475
G58 1524 1444 1329 1327 1275
L1 7639 7528 7293 7209 3474
L21 1832 1782 1712 1646 596
L35 6002 5906 5768 5688 3030
L36 31320 30589 30184 29775 21161
L4 2294 2242 1988 1940 1390
L41 5366 5270 5244 5165 3009
L47 2232 2133 2029 2008 1411
L49 1880 1825 1752 1728 810
S34AB 12332 12142 12098 11949 4100
S35AB 12453 12300 12284 12105 7268
S39A 17971 17626 17374 17153 4687
S47B 3159 3116 3074 3046 1699
S4AB 7815 7654 7533 7449 1526
S54AB 1533 1508 1458 1450 511
WD1 2063 2032 1920 1872 1821

I'am really concerned about samples observed in other subgroups that pass the filtering with 30000 or 20000 reads and then only some hundreds are merged.

A summary table of the obtained results is shown here: TruncLength (Fw,Rw) maxEE % pass trim minOverlap maxMismatch % merged ASVs
240, 230 5,8 93% 20 0 56% 269
240, 230 5,8 93% 10 5 56% 323
240, 230 7,9 96% 20 0 55% 268
240, 230 7,9 96% 10 5 56% 323
220, 220 7,9 98% 20 0 54% 291
220, 220 7,9 98% 10 5 55% 332
220, 220 7,9 98% 5 5 55% 332

%merged is calculated as indicated above: mean(track1[,5]/track1[,2]) → for any tentative I did

I feel lost!! What can I do to increase the % of merged reads? Thank you!!

benjjneb commented 3 years ago

In the vast majority of cases, large loss percentages in the read-merging step are due to truncation that stops (at least some) biological amplicons from being able to merge.

I am not myself familiar with the expected length distro for 515F/926R sequencing, but it is in the ballpark of failing to merge based on the lengths you are evaluating and the Ecoli primer coordinates.

What happens when you bump up the truncLen substantiallly, say to 270/240, in that second group of samples. I'd assume lose more in filtering, but what percentage gets merged?

MatS792 commented 3 years ago

With a truncLen(270,240) and a maxEE(7,9) pass the 93% of reads. The percentage of the merged reads is 56% again.

benjjneb commented 3 years ago

I don't know based on this information what is going on.

Would you be willing to share with me one sample that merges well, and one sample that doesn't, along with your current processing script? My email is benjamin DOT j DOT callahan AT gmail DOT com

(response times will be slower over the holiday season)

MatS792 commented 3 years ago

Email sent. Thank you.

benjjneb commented 3 years ago

@MatS792 Email received! Thank you. Sorry for the delay, I am just getting back to this today after the holidays. Hope to have more in the next few days.

MatS792 commented 3 years ago

Dear Benji, I hope everything is ok. Is there some news for the sequences I sent you? Best, Matteo

On Wed, Jan 6, 2021 at 8:05 PM Benjamin Callahan notifications@github.com wrote:

@MatS792 https://github.com/MatS792 Email received! Thank you. Sorry for the delay, I am just getting back to this today after the holidays. Hope to have more in the next few days.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/1230#issuecomment-755529203, or unsubscribe https://github.com/notifications/unsubscribe-auth/APHDUMYKDJLCL3PVTAGB5TTSYSX7DANCNFSM4U4SHKSA .

-- Matteo Selci Ph.D Student Department of Biology University of Naples "Federico II" Monte Sant'ANgelo, Edificio 7. 80126, Napoli, Italy

=========================== GiovannelliLab -

https://dgiovannelli.github.io/ https://dgiovannelli.github.io/

Follow my lab activities on Twitter and Instagram → #giovannellilab Follow me on Twitter @matteo_selci

email: selcimatteo@gmail.com matteo.selci@unina.it matteo.selci@unina.it

benjjneb commented 3 years ago

Thanks for the ping.

I think I've probably identified the problem, these data appear to be a mix of amplified 16S sequenced and amplified ITS sequences. For example take a look at:

library(ShortRead)
dna <- sread(readFastq("G49_S63_L001_R1_001.fastq.gz"))
head(dna)
Screen Shot 2021-02-04 at 11 25 08 AM

The primer sequences is at the start of the first 3 sequences as expected, but not the next 3. There is also far more variation between the first 3 and next 3 sequences than would be expected in a conserved priming region. So, BLAST them againt nt:

dada2:::pfasta(head(dna))

And the results are that the first 3 sequences hit bacterial 16S sequences, and the next 3 sequences hit ITS genes from Rhodosporidiobolus colostri, Leucosporidium sp., Alternaria tenuissima (i.e. various fungi).

This likely explains the variable merging percentages, as the ITS amplicons are of unknown length distribution, and probably often fail to merge.

This could be off-target amplification, but given how different the start of these ITS sequence is from the primer sequences, is it possible these data were generated with a mix of primers that included fungal ITS primers?

MatS792 commented 3 years ago

Hi Benji, first of all, thank you very much for your help, we are learning a lot about this mess. I will text the person that managed the sequencing to understand whether they know something about ITS primers in our samples. However, I have some questions. At this point, we want to separate 16s and ITS. Is there a way in DADA to divide the "16s-fastq" from "ITS-fastq"? Otherwise, using the function justconcatenate=TRUE during the merging step, what happens to my 16s samples? In this case, I should assign the taxonomy using SILVA files for 16s and ITS, respectively? What kind of strategy do you suggest? Best, M.S.

On Thu, Feb 4, 2021 at 5:32 PM Benjamin Callahan notifications@github.com wrote:

Thanks for the ping.

I think I've probably identified the problem, these data appear to be a mix of amplified 16S sequenced and amplified ITS sequences. For example take a look at:

library(ShortRead) dna <- sread(readFastq("G49_S63_L001_R1_001.fastq.gz")) head(dna)

[image: Screen Shot 2021-02-04 at 11 25 08 AM] https://user-images.githubusercontent.com/5797204/106923091-b83fec00-66db-11eb-9e34-8255568cb538.png

The primer sequences is at the start of the first 3 sequences as expected, but not the next 3. There is also far more variation between the first 3 and next 3 sequences than would be expected in a conserved priming region. So, BLAST them againt nt:

dada2:::pfasta(head(dna))

And the results are that the first 3 sequences hit bacterial 16S sequences, and the next 3 sequences hit ITS genes from Rhodosporidiobolus colostri, Leucosporidium sp., Alternaria tenuissima (i.e. various fungi).

This likely explains the variable merging percentages, as the ITS amplicons are of unknown length distribution, and probably often fail to merge.

This could be off-target amplification, but given how different the start of these ITS sequence is from the primer sequences, is it possible these data were generated with a mix of primers that included fungal ITS primers?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/1230#issuecomment-773438730, or unsubscribe https://github.com/notifications/unsubscribe-auth/APHDUM52XQYKBMBY7EOOTITS5LDZ7ANCNFSM4U4SHKSA .

-- Matteo Selci Ph.D Student Department of Biology University of Naples "Federico II" Monte Sant'ANgelo, Edificio 7. 80126, Napoli, Italy

=========================== GiovannelliLab -

https://dgiovannelli.github.io/ https://dgiovannelli.github.io/

Follow my lab activities on Twitter and Instagram → #giovannellilab Follow me on Twitter @matteo_selci

email: selcimatteo@gmail.com matteo.selci@unina.it matteo.selci@unina.it