Closed MatS792 closed 3 years ago
In the vast majority of cases, large loss percentages in the read-merging step are due to truncation that stops (at least some) biological amplicons from being able to merge.
I am not myself familiar with the expected length distro for 515F/926R sequencing, but it is in the ballpark of failing to merge based on the lengths you are evaluating and the Ecoli primer coordinates.
What happens when you bump up the truncLen
substantiallly, say to 270/240, in that second group of samples. I'd assume lose more in filtering, but what percentage gets merged?
With a truncLen(270,240) and a maxEE(7,9) pass the 93% of reads. The percentage of the merged reads is 56% again.
I don't know based on this information what is going on.
Would you be willing to share with me one sample that merges well, and one sample that doesn't, along with your current processing script? My email is benjamin DOT j DOT callahan AT gmail DOT com
(response times will be slower over the holiday season)
Email sent. Thank you.
@MatS792 Email received! Thank you. Sorry for the delay, I am just getting back to this today after the holidays. Hope to have more in the next few days.
Dear Benji, I hope everything is ok. Is there some news for the sequences I sent you? Best, Matteo
On Wed, Jan 6, 2021 at 8:05 PM Benjamin Callahan notifications@github.com wrote:
@MatS792 https://github.com/MatS792 Email received! Thank you. Sorry for the delay, I am just getting back to this today after the holidays. Hope to have more in the next few days.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/1230#issuecomment-755529203, or unsubscribe https://github.com/notifications/unsubscribe-auth/APHDUMYKDJLCL3PVTAGB5TTSYSX7DANCNFSM4U4SHKSA .
-- Matteo Selci Ph.D Student Department of Biology University of Naples "Federico II" Monte Sant'ANgelo, Edificio 7. 80126, Napoli, Italy
=========================== GiovannelliLab -
email: selcimatteo@gmail.com matteo.selci@unina.it matteo.selci@unina.it
Thanks for the ping.
I think I've probably identified the problem, these data appear to be a mix of amplified 16S sequenced and amplified ITS sequences. For example take a look at:
library(ShortRead)
dna <- sread(readFastq("G49_S63_L001_R1_001.fastq.gz"))
head(dna)
The primer sequences is at the start of the first 3 sequences as expected, but not the next 3. There is also far more variation between the first 3 and next 3 sequences than would be expected in a conserved priming region. So, BLAST them againt nt:
dada2:::pfasta(head(dna))
And the results are that the first 3 sequences hit bacterial 16S sequences, and the next 3 sequences hit ITS genes from Rhodosporidiobolus colostri, Leucosporidium sp., Alternaria tenuissima (i.e. various fungi).
This likely explains the variable merging percentages, as the ITS amplicons are of unknown length distribution, and probably often fail to merge.
This could be off-target amplification, but given how different the start of these ITS sequence is from the primer sequences, is it possible these data were generated with a mix of primers that included fungal ITS primers?
Hi Benji, first of all, thank you very much for your help, we are learning a lot about this mess. I will text the person that managed the sequencing to understand whether they know something about ITS primers in our samples. However, I have some questions. At this point, we want to separate 16s and ITS. Is there a way in DADA to divide the "16s-fastq" from "ITS-fastq"? Otherwise, using the function justconcatenate=TRUE during the merging step, what happens to my 16s samples? In this case, I should assign the taxonomy using SILVA files for 16s and ITS, respectively? What kind of strategy do you suggest? Best, M.S.
On Thu, Feb 4, 2021 at 5:32 PM Benjamin Callahan notifications@github.com wrote:
Thanks for the ping.
I think I've probably identified the problem, these data appear to be a mix of amplified 16S sequenced and amplified ITS sequences. For example take a look at:
library(ShortRead) dna <- sread(readFastq("G49_S63_L001_R1_001.fastq.gz")) head(dna)
[image: Screen Shot 2021-02-04 at 11 25 08 AM] https://user-images.githubusercontent.com/5797204/106923091-b83fec00-66db-11eb-9e34-8255568cb538.png
The primer sequences is at the start of the first 3 sequences as expected, but not the next 3. There is also far more variation between the first 3 and next 3 sequences than would be expected in a conserved priming region. So, BLAST them againt nt:
dada2:::pfasta(head(dna))
And the results are that the first 3 sequences hit bacterial 16S sequences, and the next 3 sequences hit ITS genes from Rhodosporidiobolus colostri, Leucosporidium sp., Alternaria tenuissima (i.e. various fungi).
This likely explains the variable merging percentages, as the ITS amplicons are of unknown length distribution, and probably often fail to merge.
This could be off-target amplification, but given how different the start of these ITS sequence is from the primer sequences, is it possible these data were generated with a mix of primers that included fungal ITS primers?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/1230#issuecomment-773438730, or unsubscribe https://github.com/notifications/unsubscribe-auth/APHDUM52XQYKBMBY7EOOTITS5LDZ7ANCNFSM4U4SHKSA .
-- Matteo Selci Ph.D Student Department of Biology University of Naples "Federico II" Monte Sant'ANgelo, Edificio 7. 80126, Napoli, Italy
=========================== GiovannelliLab -
email: selcimatteo@gmail.com matteo.selci@unina.it matteo.selci@unina.it
Hi Benj, I am wondering why I am losing such a big amount of reads after the merging.
For the analysis, I used a workstation with an i7-9th generation processor,32gb of ram, and Ubuntu 20.04, and dada 1.14.1. I am trying to analyze 16S rRNA sequences of about 100 - 150 samples. V4-V5 regions were amplified with universal primers: 515FB = GTGYCAGCMGCCGCGGTAA, 926R = CCGYCAATTYMTTTRAGTTT . The first time I run the dada2 pipeline (1.14) a lot of reads didn't pass the merging step, giving me 0 mergings. To solve this, I tried to change the filter and trimming settings (truncLen and maxEE) but I fall always in the same problem. It appears that depending on the parameters a subset of the sequences gave good results.
Therefore I've divided the fastq into subgroups with the aim to do the analysis at different times and merging them when I got a satisfying result.
I analyzed the first group, the heaviest in terms of megabytes. The quality plots for Fw and Rw
With 93% of reads in average that pass the filtering&trim step (function used → mean(out1[,2]/out1[,1])).
For the dada steps I used: dadaFs1 <- dada(derepFs1, err=errF1, pool="pseudo", multithread=TRUE) dadaRs1 <- dada(derepRs1, err=errR1, pool="pseudo", multithread=TRUE)
For the merging step I got the best result using: minOverlap = 10 and maxMismatch=5 : mergers1 <- mergePairs(dadaFs1, derepFs1, dadaRs1, derepRs1, minOverlap = 10, maxMismatch = 5, verbose=TRUE)
With 92% of reads in average that were merged (function used → mean(track1[,5]/track1[,2])).
The problems began with the second subgroup, about 20 samples. The quality plot for Fw and Rw
Since their low quality, samples G35, G53, L9, L52, L56, and S55A were excluded from the analysis.
For this subgroup, I tried several settings for the filter&trim, and the merging step. An example of track object is here:
I'am really concerned about samples observed in other subgroups that pass the filtering with 30000 or 20000 reads and then only some hundreds are merged.
%merged is calculated as indicated above: mean(track1[,5]/track1[,2]) → for any tentative I did
I feel lost!! What can I do to increase the % of merged reads? Thank you!!