benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
471 stars 143 forks source link

No reads overlap and yet good merging in Dada2 #1651

Closed QGaelle closed 5 months ago

QGaelle commented 1 year ago

Hi @benjjneb,

Thank you so much for the incredible help you've been providing to everyone. I am right now looking for the best filtering parameters for my data and have been running some tests on a few samples. I have been a bit surprised to get a good merging even though I realised afterwards that my parameters were not allowing any merging.. Or am I calculating the overlapping length wrong? Could you please check and maybe explain why the merging isn't dropping as it should?

Primers used: 16S Bactéries (Parada et al. 2016) - 411 pb (universel V4-V5) 16Sbac-515F GTGYCAGCMGCCGCGGTAA 16Sbac-926R CCGYCAATTYMTTTRAGTTT

Example of quality profiles (reverse reads):

Screenshot 2022-12-02 at 10 40 52

Parameters I chose:

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft=c(19,20), truncLen=c(240,190),
                     maxN=0, maxEE=c(2,5), truncQ=2, rm.phix=TRUE,
                     compress=TRUE, multithread=TRUE, matchIDs = TRUE) 

> head(track)
             input filtered denoisedF denoisedR merged nonchim
ERM-1        32797    31037     30287     30385  26353   26287
ERM-2        35275    33200     32301     32451  27209   26020
ETS-0221-375 20754    20058     19900     19821  14590   14575
ETS-0221-376 32633    31359     31089     30982  21531   21490

Thanks a lot for your help. Gaëlle

benjjneb commented 1 year ago

We recommend that trucnLen of your forward and reverse reads should add up to the length of the sequenced amplicoin, plus 20 bps. Your parameters put you basically at that point: total truncLen of 430 and amplicon length of 411.

QGaelle commented 1 year ago

Hi @benjjneb, thank you for your time. After reading you answer, I realised that I got confused during my multiple tests. I thought I had good merging with the following parameters truncLen=c(240,150) and not with truncLen=c(240,190) as posted. Hence, I could not understand why merging was ok when obviously if I cut at 240 and 150 then there isn't any overlapping anymore, am I correct? I ran again with truncLen=c(240,150) and merging is indeed very bad. Sorry about that and thank you for always replying so quickly.

benjjneb commented 1 year ago

if I cut at 240 and 150 then there isn't any overlapping anymore, am I correct?

Yep that's right, so your results make sense as far as I can tell.

QGaelle commented 1 year ago

Perfect, thanks a lot. May I ask one last question? I guess there is always the trade-off between truncLen and merging. What percentage of loss associated with the merging step is acceptable?

If I have the following quality plots: Quality_plots_F Quality_plots_R

Considering these plots, I tried this set of parameters.

out_1 <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft=c(19,20), truncLen=c(225,205),
+                         maxN=0, maxEE=c(2,5), truncQ=2, rm.phix=TRUE,
+                         compress=TRUE, multithread=TRUE)

> head(track)
            input filtered denoisedF denoisedR merged nonchim
ETS-1220-02 70176    67179     66962     66658  46807   46560
PIE1-3u     54641    47207     45771     45386  38140   36641
RUN21-GRB-1 41661    38953     38589     38473  30845   30678

Is the percentage of loss associated with the merging step acceptable? Would it better to cut more or to increase the percentage of merging?

Thanks Gaëlle

benjjneb commented 1 year ago

In general, you should be looking for 90% or at least 80% of reads to pass merging. You arent' seeing anything close to that, so it is a concern.

Is there length variatoin in this locus, some of which is being lost by trying to "optimize" truncLen to be at a minimum compared to your expected lengths? That kind of optimization is almost never the right choice. Keep biological length variation, and accept more losses at filtering.

QGaelle commented 1 year ago

Hi @benjjneb,

Thank you for your last comment. I have been reading a lot these last couple of days and the numerous threads you contributed to have been very useful. Some doubts on some aspects remain and I would like your opinion on it.

After your last comment, I tried with different parameters to see how much it changed the loss at the filtering and merging steps and tbh I am a bit confused. I used the same subdata set as in my question just above where I show the same quality plots.

out_2 <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, trimLeft=c(19,20), truncLen=c(225,225),
+                        maxN=0, maxEE=c(2,5), truncQ=2, rm.phix=TRUE,
+                        compress=TRUE, multithread=TRUE)

> head(track)
            input filtered denoisedF denoisedR merged nonchim
ETS-1220-02 25609    24979     24746     24726  11097   11090
PIE1-3u     27157    24327     24295     23849  10086   10083
RUN21-GRB-1 22087    21156     20794     20542  18885   18631

In comparison to when I used truncLen=c(225,205) (see above), the percentage loss at the merging step is even more important while the loss at the filtering step hasn't really changed. I was expecting better merging. I don't think I can cut more than previously truncLen=c(225,205) or I won't get nay overlap anymore? Sorry if that is trivial question but could you maybe explain why the merging is worse?

In your comment, you were also asking whether there was length variation in this locus. If I take a look at the fast files of one of my samples for example (see photo below) and check randomly reads length, it appears that some reads 250 others 251 or 252 nucleotides, is that what you meant?

The highlighted part of the sequence on the photo below shows the presence of the primer: 16Sbac-515F GTGYCAGCMGCCGCGGTAA

In my parameters, I used trimLeft=c(19,20) because my primers appears to be 19 nucleotides but according to the fastq files, to remove the primer entirely, I should remove 20 or 21 sometimes 22 nucleotides. Is that a problem? Does this mean that they won't have the same starting point after trimming left? How many should I trim then?

Capture d’écran 2022-12-16 à 20 20 54

Sorry, that is a lot of questions. I am trying to be very meticulous about these first steps. I have 4 different runs that I will merge later and I want to do the first steps right for each run.

Thanks a lot in advance for your help, Gaëlle

benjjneb commented 1 year ago

In my parameters, I used trimLeft=c(19,20) because my primers appears to be 19 nucleotides but according to the fastq files, to remove the primer entirely, I should remove 20 or 21 sometimes 22 nucleotides. Is that a problem?

Yes. This is a major problem. DADA2 is assuming that your reads all start at a consistent position. What may be going on here is that your sequencing provider used "heterogeneity spacers", i.e. variable length pads at the start of the reads. This is sometimes done to address the low-base-complexity issue of amplicon sequencing, that interferes with Illumina's base-calling calibration algorithm. (However, just adding 5-10% phiX works just as well, and is recommended because the data is simpler to subsequently analyze)

What I would recommend is inspecting a set of about 50 forward reads. Identifying the 515F primer, and record the length of the "pad" before the primer starts. If that pad is variable in length, you will need to find an external solution to remove the primers, so that when you enter DADA2, the reads (both forward and reverse) all begin at the same position.

QGaelle commented 1 year ago

Hi @benjjneb,

Getting back to this thread after the holiday break. Thank you for your last comment. Following your suggestion, I inspected about 50 forward reads and found that the length of the pad before the primer starts varied between 1-3 nucleotides. In order to remove the HS (heterogeneity spacers), I ran the following script Trimming_HS_Metagenomics which worked great. I used the non variable part of my primer 6Sbac-515F GTGYCAGCMGCCGCGGTAA and as a result, all the bases before the indicated part of the primer were removed. I then just have to use TrimLeft in Dada2 to remove the remaining bases of the primer. So this part is ok, I guess?

However, looking at the output file after using the Trimming_HS script, I noticed that some reads were removed completely because the primer was not found. The highlighted part of the sequence on the photo below shows the presence of the primer: 16Sbac-515F GTGYCAGCMGCCGCGGTAA and you can see that the fourth read is not highlighted. It looks like there is a nucleotide missing GTGYCAGCMGCGCGGTAA instead of GTGYCAGCMGCCGCGGTAA.

Original fastq file FastQ_original

Output Trimming_HS indicating the number of nucleotides removed before the highlighted part Output_Trim

I blasted the sequence and it appears to be an ok sequence relevant to my dataset. In this case, there is a missing nucleotide, but there are probably many other ways why the primer would not be found. Is it worrisome? Should I just keep going even if several reads will be removed as this isn't an isolated situation and there are several similar cases among my fastq files?

Thank you so much in advance, Gaëlle

benjjneb commented 1 year ago

I blasted the sequence and it appears to be an ok sequence relevant to my dataset. In this case, there is a missing nucleotide, but there are probably many other ways why the primer would not be found. Is it worrisome? Should I just keep going even if several reads will be removed as this isn't an isolated situation and there are several similar cases among my fastq files?

Losing some reads because the primer is mis-sequenced is not a problem, assuming that the mis-sequencing is random across your reads.