MatS792 commented 3 years ago

Hi Benj, I am wondering why I am losing such a big amount of reads after the merging.

For the analysis, I used a workstation with an i7-9th generation processor,32gb of ram, and Ubuntu 20.04, and dada 1.14.1. I am trying to analyze 16S rRNA sequences of about 100 - 150 samples. V4-V5 regions were amplified with universal primers: 515FB = GTGYCAGCMGCCGCGGTAA, 926R = CCGYCAATTYMTTTRAGTTT . The first time I run the dada2 pipeline (1.14) a lot of reads didn't pass the merging step, giving me 0 mergings. To solve this, I tried to change the filter and trimming settings (truncLen and maxEE) but I fall always in the same problem. It appears that depending on the parameters a subset of the sequences gave good results.

Therefore I've divided the fastq into subgroups with the aim to do the analysis at different times and merging them when I got a satisfying result.

I analyzed the first group, the heaviest in terms of megabytes. The quality plots for Fw QplotF and Rw QplotR

I got the best result with these filter and trim settings: filterAndTrim(fnFs1, filtFs1, fnRs1, filtRs1, truncLen=c(260,200), maxN=0, maxEE=c(4,8), truncQ=2, rm.phix=TRUE, trimLeft=c(20,21), compress=TRUE, multithread=TRUE) OUT:	reads.in	reads.out
G34_S74_L001_R1_001.fastq.gz	31589	29854
G34_S74_L001_R1_001.fastq.gz	49844	47163
G36_S86_L001_R1_001.fastq.gz	21408	20341
G41_S15_L001_R1_001.fastq.gz	2860	2713
G45_S169_L001_R1_001.fastq.gz	3125	2562
G47_S39_L001_R1_001.fastq.gz	3497	3284

With 93% of reads in average that pass the filtering&trim step (function used → mean(out1[,2]/out1[,1])).

For the dada steps I used: dadaFs1 <- dada(derepFs1, err=errF1, pool="pseudo", multithread=TRUE) dadaRs1 <- dada(derepRs1, err=errR1, pool="pseudo", multithread=TRUE)

For the merging step I got the best result using: minOverlap = 10 and maxMismatch=5 : mergers1 <- mergePairs(dadaFs1, derepFs1, dadaRs1, derepRs1, minOverlap = 10, maxMismatch = 5, verbose=TRUE)

	input	filtered	denoisedF	denoisedR	merged
G32	31589	29854	29783	29741	29680
G34	49844	47163	47059	47015	46766
G36	21408	20341	20301	20207	17834
G41	2860	2713	2695	2683	2472
G45	3125	2562	2479	2493	2406
G47	3497	3284	3246	3233	3225

With 92% of reads in average that were merged (function used → mean(track1[,5]/track1[,2])).

The problems began with the second subgroup, about 20 samples. The quality plot for Fw QplotF1 and Rw QplotR1

Since their low quality, samples G35, G53, L9, L52, L56, and S55A were excluded from the analysis.

For this subgroup, I tried several settings for the filter&trim, and the merging step. An example of track object is here:

	input	filtered	denoisedF	denoisedR	merged
G25	3020	2985	2931	2894	1113
G49	1582	1544	1470	1432	1413
G51	1662	1637	1536	1521	1475
G58	1524	1444	1329	1327	1275
L1	7639	7528	7293	7209	3474
L21	1832	1782	1712	1646	596
L35	6002	5906	5768	5688	3030
L36	31320	30589	30184	29775	21161
L4	2294	2242	1988	1940	1390
L41	5366	5270	5244	5165	3009
L47	2232	2133	2029	2008	1411
L49	1880	1825	1752	1728	810
S34AB	12332	12142	12098	11949	4100
S35AB	12453	12300	12284	12105	7268
S39A	17971	17626	17374	17153	4687
S47B	3159	3116	3074	3046	1699
S4AB	7815	7654	7533	7449	1526
S54AB	1533	1508	1458	1450	511
WD1	2063	2032	1920	1872	1821

I'am really concerned about samples observed in other subgroups that pass the filtering with 30000 or 20000 reads and then only some hundreds are merged.

A summary table of the obtained results is shown here: TruncLength (Fw,Rw)	maxEE	% pass trim	minOverlap	maxMismatch	% merged	ASVs
240, 230	5,8	93%	20	0	56%	269
240, 230	5,8	93%	10	5	56%	323
240, 230	7,9	96%	20	0	55%	268
240, 230	7,9	96%	10	5	56%	323
220, 220	7,9	98%	20	0	54%	291
220, 220	7,9	98%	10	5	55%	332
220, 220	7,9	98%	5	5	55%	332

%merged is calculated as indicated above: mean(track1[,5]/track1[,2]) → for any tentative I did

I feel lost!! What can I do to increase the % of merged reads? Thank you!!

benjjneb commented 3 years ago

In the vast majority of cases, large loss percentages in the read-merging step are due to truncation that stops (at least some) biological amplicons from being able to merge.

I am not myself familiar with the expected length distro for 515F/926R sequencing, but it is in the ballpark of failing to merge based on the lengths you are evaluating and the Ecoli primer coordinates.

What happens when you bump up the truncLen substantiallly, say to 270/240, in that second group of samples. I'd assume lose more in filtering, but what percentage gets merged?

MatS792 commented 3 years ago

With a truncLen(270,240) and a maxEE(7,9) pass the 93% of reads. The percentage of the merged reads is 56% again.

benjjneb commented 3 years ago

I don't know based on this information what is going on.

Would you be willing to share with me one sample that merges well, and one sample that doesn't, along with your current processing script? My email is benjamin DOT j DOT callahan AT gmail DOT com

(response times will be slower over the holiday season)

MatS792 commented 3 years ago

Email sent. Thank you.

benjjneb commented 3 years ago

@MatS792 Email received! Thank you. Sorry for the delay, I am just getting back to this today after the holidays. Hope to have more in the next few days.

MatS792 commented 3 years ago

Dear Benji, I hope everything is ok. Is there some news for the sequences I sent you? Best, Matteo

On Wed, Jan 6, 2021 at 8:05 PM Benjamin Callahan notifications@github.com wrote:

@MatS792 https://github.com/MatS792 Email received! Thank you. Sorry for the delay, I am just getting back to this today after the holidays. Hope to have more in the next few days.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/1230#issuecomment-755529203, or unsubscribe https://github.com/notifications/unsubscribe-auth/APHDUMYKDJLCL3PVTAGB5TTSYSX7DANCNFSM4U4SHKSA .

-- Matteo Selci Ph.D Student Department of Biology University of Naples "Federico II" Monte Sant'ANgelo, Edificio 7. 80126, Napoli, Italy

=========================== GiovannelliLab -

https://dgiovannelli.github.io/ https://dgiovannelli.github.io/

Follow my lab activities on Twitter and Instagram → #giovannellilab Follow me on Twitter @matteo_selci

email: selcimatteo@gmail.com matteo.selci@unina.it matteo.selci@unina.it

benjjneb commented 3 years ago

Thanks for the ping.

I think I've probably identified the problem, these data appear to be a mix of amplified 16S sequenced and amplified ITS sequences. For example take a look at:

library(ShortRead)
dna <- sread(readFastq("G49_S63_L001_R1_001.fastq.gz"))
head(dna)

The primer sequences is at the start of the first 3 sequences as expected, but not the next 3. There is also far more variation between the first 3 and next 3 sequences than would be expected in a conserved priming region. So, BLAST them againt nt:

dada2:::pfasta(head(dna))

And the results are that the first 3 sequences hit bacterial 16S sequences, and the next 3 sequences hit ITS genes from Rhodosporidiobolus colostri, Leucosporidium sp., Alternaria tenuissima (i.e. various fungi).

This likely explains the variable merging percentages, as the ITS amplicons are of unknown length distribution, and probably often fail to merge.

This could be off-target amplification, but given how different the start of these ITS sequence is from the primer sequences, is it possible these data were generated with a mix of primers that included fungal ITS primers?

MatS792 commented 3 years ago

Hi Benji, first of all, thank you very much for your help, we are learning a lot about this mess. I will text the person that managed the sequencing to understand whether they know something about ITS primers in our samples. However, I have some questions. At this point, we want to separate 16s and ITS. Is there a way in DADA to divide the "16s-fastq" from "ITS-fastq"? Otherwise, using the function justconcatenate=TRUE during the merging step, what happens to my 16s samples? In this case, I should assign the taxonomy using SILVA files for 16s and ITS, respectively? What kind of strategy do you suggest? Best, M.S.

On Thu, Feb 4, 2021 at 5:32 PM Benjamin Callahan notifications@github.com wrote:

Thanks for the ping.

I think I've probably identified the problem, these data appear to be a mix of amplified 16S sequenced and amplified ITS sequences. For example take a look at:

library(ShortRead) dna <- sread(readFastq("G49_S63_L001_R1_001.fastq.gz")) head(dna)

[image: Screen Shot 2021-02-04 at 11 25 08 AM] https://user-images.githubusercontent.com/5797204/106923091-b83fec00-66db-11eb-9e34-8255568cb538.png

The primer sequences is at the start of the first 3 sequences as expected, but not the next 3. There is also far more variation between the first 3 and next 3 sequences than would be expected in a conserved priming region. So, BLAST them againt nt:

dada2:::pfasta(head(dna))

And the results are that the first 3 sequences hit bacterial 16S sequences, and the next 3 sequences hit ITS genes from Rhodosporidiobolus colostri, Leucosporidium sp., Alternaria tenuissima (i.e. various fungi).

This likely explains the variable merging percentages, as the ITS amplicons are of unknown length distribution, and probably often fail to merge.

This could be off-target amplification, but given how different the start of these ITS sequence is from the primer sequences, is it possible these data were generated with a mix of primers that included fungal ITS primers?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/1230#issuecomment-773438730, or unsubscribe https://github.com/notifications/unsubscribe-auth/APHDUM52XQYKBMBY7EOOTITS5LDZ7ANCNFSM4U4SHKSA .

-- Matteo Selci Ph.D Student Department of Biology University of Naples "Federico II" Monte Sant'ANgelo, Edificio 7. 80126, Napoli, Italy

=========================== GiovannelliLab -

https://dgiovannelli.github.io/ https://dgiovannelli.github.io/

Follow my lab activities on Twitter and Instagram → #giovannellilab Follow me on Twitter @matteo_selci

email: selcimatteo@gmail.com matteo.selci@unina.it matteo.selci@unina.it

benjjneb / dada2

Variable loss of large number of reads at mergePairs step #1230

https://dgiovannelli.github.io/ https://dgiovannelli.github.io/

Follow my lab activities on Twitter and Instagram → #giovannellilab Follow me on Twitter @matteo_selci

https://dgiovannelli.github.io/ https://dgiovannelli.github.io/

Follow my lab activities on Twitter and Instagram → #giovannellilab Follow me on Twitter @matteo_selci