benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
468 stars 142 forks source link

Losing a high percentage of reads after merging AND after chimeras removal. #1857

Closed NoGaud closed 4 months ago

NoGaud commented 10 months ago

Hello,

After some days going through all the issues that seemed pertinent here, i think i need to ask more guidance about my issue.

In my track, i lose about 60% of my reads after merging, then i lose around 40% of the merged reads after the chimeras removal.

The sequencing was Illumina 2x300 on the V3-V4 region of the 16S region. I use the primers 341F and 805R (so around 464 bp is expected). There's usually no heterogeneity spacers with this company.

So my out command is: out.bact1 <- filterAndTrim(fnFs.bact1,filtFs.bact1,fnRs.bact1,filtRs.bact1,truncLen=c(285,235), maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE, compress=TRUE, trimLeft=c(17,21), multithread=TRUE)

The quality of my files are pretty good also. Forward:

Capture d’écran, le 2023-11-28 à 11 01 04

Reverse:

Capture d’écran, le 2023-11-28 à 11 01 21

Plot errors doesn't show any red flags:

Capture d’écran, le 2023-11-30 à 11 05 41

Here is my track: track.bact.1.txt

So, first of all i removed my primers with trimLeft, and i tested that it was really removed with the ITS pipeline to be sure:

Capture d’écran, le 2023-11-30 à 10 13 18 Capture d’écran, le 2023-11-30 à 10 40 50

(I saw somewhere that 200 primers left was okay to not interfere with the analyses so i assume 33 and 66 is okay too).

My overlap should be from 13 to 18 (so >12) in my tests that get the most reads out of the pipeline (TruncLen = 285,235 or 280,235). I tried bigger overlaps (TruncLen (290,250)) to be sure it didn't magically unblock a lot of reads to go through, but it didn't, it only contributed to losing more reads. I read about the bimodal distribution in reads length for V3V4 and i know i still have both of those lengths in my data with the getSequences function (around 403 and 427 right?). Making maxEE higher is not really worth it and doesn't change that much the number of reads either.

I'm wondering if there's something that i missed, because i actually have two sequencing runs i need to analyze and the track is so similar for both of them. I know it could be weird because of the sequencing run, but for both of them to be similarly "bad" would be a rare coincidence? (One run is from June 2023, the other from November 2023.)

I was wondering if there's anything more i could have missed, what more i can look at to get less % of reads lost in the process.

Thanks a lot for your work

salix-d commented 10 months ago

Just by looking at the quality plots, I think I would've gone for TruncLen=250,200. The way I was shown was to try and cut before the big noise, unless it prevents overlaps entirely.

NoGaud commented 10 months ago

Hello, Thanks for your input but sadly it prevents the overlap in most samples (highest sample is 25 reads left after nochim). With truncLen at 285-235, the overlap is already only around 18 if i calculated it right.

(If for 464 bp of V3V4, 300 bp sequenced in both direction has an overlap of 136 bp at first, than i cut the primers so 136-17-21 = 98 bp. So i only have around 86 max to cut if i want to keep an overlap over 12. TruncLen = 250, 200 would cut 150 bp :(

benjjneb commented 10 months ago

The removal of the primers doesn't affect the overlap, as that is occuring at the ends of the amplicon, not in the middle region where the overlap happens. See how removing the primers in the crude schematic below doesn't impact overlap.

R1: FWDPRIM--------------------------------|
R2:                         |---------------------------------REVPRIM
NoGaud commented 10 months ago

Oh, thanks for the info! I did a TruncLen at (270, 205) to cut more and have less overlap. I have more reads, but i still have big drops at merged and nochim steps.

track.bact.1.2.txt

NoGaud commented 9 months ago

Hello, it's still me, in between i read that i needed to join my two sequencing runs after merging, but before chimeras. In the end i get a result like this:

Identified 44376 bimeras out of 72470 input sequences. dim(seq.tab.nochim.bact2) [1] 140 28094 sum(seqtab.nochim)/sum(seqtab) [1] 0.6521404

Since my primers are okay, what else could be the cause of that many chimeras?

If it can help, here's the result to the getSequences step:

# Inspect distribution of sequence lengths
table(nchar(getSequences(seq.tab.bact2)))

Capture d’écran, le 2023-12-07 à 10 13 10

benjjneb commented 9 months ago

That chimera read fraction is high enough to cause concern.

I'm not sure. How certain is "There's usually no heterogeneity spacers with this company"? Have you verified on this dataset? Looking at a fastq file can help a lot. Literally opening it up and looking at the first dozen of so entries.

NoGaud commented 9 months ago

Hello,

I tried to contact them to know for sure. For now, if i look at my fastq files, after the primers it's not always the same sequence. (Would the same sequence at this position be the heterogeneity spacers? I'm not familiar with them and how they exist.)

Here is some screenshot to be sure, the selected text begins right after the primers.

Capture d’écran, le 2023-12-08 à 11 16 15 Capture d’écran, le 2023-12-08 à 11 17 51
benjjneb commented 8 months ago

Sorry for the late response, I lost this over the holidays.

I see these 7-mers highlighted in your fastq files, but what does that mean? Are those part of the primers? What part? What are the primers? What does it mean that some of the sequences don't have those 7-mers?