Closed QGaelle closed 5 months ago
We recommend that trucnLen
of your forward and reverse reads should add up to the length of the sequenced amplicoin, plus 20 bps. Your parameters put you basically at that point: total truncLen
of 430 and amplicon length of 411.
Hi @benjjneb, thank you for your time. After reading you answer, I realised that I got confused during my multiple tests. I thought I had good merging with the following parameters truncLen=c(240,150) and not with truncLen=c(240,190) as posted. Hence, I could not understand why merging was ok when obviously if I cut at 240 and 150 then there isn't any overlapping anymore, am I correct? I ran again with truncLen=c(240,150) and merging is indeed very bad. Sorry about that and thank you for always replying so quickly.
if I cut at 240 and 150 then there isn't any overlapping anymore, am I correct?
Yep that's right, so your results make sense as far as I can tell.
Perfect, thanks a lot. May I ask one last question? I guess there is always the trade-off between truncLen and merging. What percentage of loss associated with the merging step is acceptable?
If I have the following quality plots:
Considering these plots, I tried this set of parameters.
out_1 <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft=c(19,20), truncLen=c(225,205),
+ maxN=0, maxEE=c(2,5), truncQ=2, rm.phix=TRUE,
+ compress=TRUE, multithread=TRUE)
> head(track)
input filtered denoisedF denoisedR merged nonchim
ETS-1220-02 70176 67179 66962 66658 46807 46560
PIE1-3u 54641 47207 45771 45386 38140 36641
RUN21-GRB-1 41661 38953 38589 38473 30845 30678
Is the percentage of loss associated with the merging step acceptable? Would it better to cut more or to increase the percentage of merging?
Thanks Gaëlle
In general, you should be looking for 90% or at least 80% of reads to pass merging. You arent' seeing anything close to that, so it is a concern.
Is there length variatoin in this locus, some of which is being lost by trying to "optimize" truncLen
to be at a minimum compared to your expected lengths? That kind of optimization is almost never the right choice. Keep biological length variation, and accept more losses at filtering.
Hi @benjjneb,
Thank you for your last comment. I have been reading a lot these last couple of days and the numerous threads you contributed to have been very useful. Some doubts on some aspects remain and I would like your opinion on it.
After your last comment, I tried with different parameters to see how much it changed the loss at the filtering and merging steps and tbh I am a bit confused. I used the same subdata set as in my question just above where I show the same quality plots.
out_2 <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft=c(19,20), truncLen=c(225,225),
+ maxN=0, maxEE=c(2,5), truncQ=2, rm.phix=TRUE,
+ compress=TRUE, multithread=TRUE)
> head(track)
input filtered denoisedF denoisedR merged nonchim
ETS-1220-02 25609 24979 24746 24726 11097 11090
PIE1-3u 27157 24327 24295 23849 10086 10083
RUN21-GRB-1 22087 21156 20794 20542 18885 18631
In comparison to when I used truncLen=c(225,205) (see above), the percentage loss at the merging step is even more important while the loss at the filtering step hasn't really changed. I was expecting better merging. I don't think I can cut more than previously truncLen=c(225,205) or I won't get nay overlap anymore? Sorry if that is trivial question but could you maybe explain why the merging is worse?
In your comment, you were also asking whether there was length variation in this locus. If I take a look at the fast files of one of my samples for example (see photo below) and check randomly reads length, it appears that some reads 250 others 251 or 252 nucleotides, is that what you meant?
The highlighted part of the sequence on the photo below shows the presence of the primer: 16Sbac-515F GTGYCAGCMGCCGCGGTAA
In my parameters, I used trimLeft=c(19,20) because my primers appears to be 19 nucleotides but according to the fastq files, to remove the primer entirely, I should remove 20 or 21 sometimes 22 nucleotides. Is that a problem? Does this mean that they won't have the same starting point after trimming left? How many should I trim then?
Sorry, that is a lot of questions. I am trying to be very meticulous about these first steps. I have 4 different runs that I will merge later and I want to do the first steps right for each run.
Thanks a lot in advance for your help, Gaëlle
In my parameters, I used trimLeft=c(19,20) because my primers appears to be 19 nucleotides but according to the fastq files, to remove the primer entirely, I should remove 20 or 21 sometimes 22 nucleotides. Is that a problem?
Yes. This is a major problem. DADA2 is assuming that your reads all start at a consistent position. What may be going on here is that your sequencing provider used "heterogeneity spacers", i.e. variable length pads at the start of the reads. This is sometimes done to address the low-base-complexity issue of amplicon sequencing, that interferes with Illumina's base-calling calibration algorithm. (However, just adding 5-10% phiX works just as well, and is recommended because the data is simpler to subsequently analyze)
What I would recommend is inspecting a set of about 50 forward reads. Identifying the 515F primer, and record the length of the "pad" before the primer starts. If that pad is variable in length, you will need to find an external solution to remove the primers, so that when you enter DADA2, the reads (both forward and reverse) all begin at the same position.
Hi @benjjneb,
Getting back to this thread after the holiday break. Thank you for your last comment. Following your suggestion, I inspected about 50 forward reads and found that the length of the pad before the primer starts varied between 1-3 nucleotides. In order to remove the HS (heterogeneity spacers), I ran the following script Trimming_HS_Metagenomics which worked great. I used the non variable part of my primer 6Sbac-515F GTGYCAGCMGCCGCGGTAA and as a result, all the bases before the indicated part of the primer were removed. I then just have to use TrimLeft in Dada2 to remove the remaining bases of the primer. So this part is ok, I guess?
However, looking at the output file after using the Trimming_HS script, I noticed that some reads were removed completely because the primer was not found. The highlighted part of the sequence on the photo below shows the presence of the primer: 16Sbac-515F GTGYCAGCMGCCGCGGTAA and you can see that the fourth read is not highlighted. It looks like there is a nucleotide missing GTGYCAGCMGCGCGGTAA instead of GTGYCAGCMGCCGCGGTAA.
Original fastq file
Output Trimming_HS indicating the number of nucleotides removed before the highlighted part
I blasted the sequence and it appears to be an ok sequence relevant to my dataset. In this case, there is a missing nucleotide, but there are probably many other ways why the primer would not be found. Is it worrisome? Should I just keep going even if several reads will be removed as this isn't an isolated situation and there are several similar cases among my fastq files?
Thank you so much in advance, Gaëlle
I blasted the sequence and it appears to be an ok sequence relevant to my dataset. In this case, there is a missing nucleotide, but there are probably many other ways why the primer would not be found. Is it worrisome? Should I just keep going even if several reads will be removed as this isn't an isolated situation and there are several similar cases among my fastq files?
Losing some reads because the primer is mis-sequenced is not a problem, assuming that the mis-sequencing is random across your reads.
Hi @benjjneb,
Thank you so much for the incredible help you've been providing to everyone. I am right now looking for the best filtering parameters for my data and have been running some tests on a few samples. I have been a bit surprised to get a good merging even though I realised afterwards that my parameters were not allowing any merging.. Or am I calculating the overlapping length wrong? Could you please check and maybe explain why the merging isn't dropping as it should?
Primers used: 16S Bactéries (Parada et al. 2016) - 411 pb (universel V4-V5) 16Sbac-515F GTGYCAGCMGCCGCGGTAA 16Sbac-926R CCGYCAATTYMTTTRAGTTT
Example of quality profiles (reverse reads):
Parameters I chose:
Thanks a lot for your help. Gaëlle