Losing reads after merge and chimera steps

benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution

http://benjjneb.github.io/dada2/

GNU Lesser General Public License v3.0

464 stars 142 forks source link

Losing reads after merge and chimera steps #830

Closed Nuisance33 closed 4 years ago

Nuisance33 commented 5 years ago

Hello, I am losing a large portion of my reads at the merge and chimera removal steps. The primers have been removed. My reads are from the V3-V4 region. My quality profiles look good so I might not need to use truncLen, but even so I still lose a large number of reads.

Does anyone know where I can find out more about these parameters and adjustments I can make when filtering so that I get better merging? Also, what is a reasonable amount of reads to lose during these steps? Many thanks!!

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(240,240),
maxN=0, maxEE=c(2,5), truncQ=2, rm.phix=TRUE,
compress=TRUE, multithread=FALSE)

input filtered denoisedF denoisedR merged nonchim LT01-UTC1 33402 30885 25766 26718 14308 9825 LT02-UTC2 33644 30914 26203 27133 14140 10400 LT03-UTC3 31309 28928 24428 25508 13610 9459 LT04-UKE1 34064 31021 25477 26745 11771 8200 LT05-UKE2 33502 30723 25232 26370 13334 7384 LT06-UKE3 32618 29826 24384 25183 11556 7663

benjjneb commented 5 years ago

truncLen=c(240,240)

What is the exact primer set you are using? Were primers removed from the reads prior to filtering with dada2 (i.e. be an external program)?

Nuisance33 commented 5 years ago

The company that did the sequencing told me that the primers have already been trimmed prior to me receiving them. I can reach out to them again to try and get the exact sequences to make sure.

benjjneb commented 5 years ago

I can reach out to them again to try and get the exact sequences to make sure.

I'd recommend that, merging issues are almost always caused by a mismatch between filtering parameters and the actual length of the sequenced amplicon, knowing what the primers are exactly, and double-checking if they were removed (or not sequenced) is the best way to identify if that is truly what's going on here.

You might also ask them what the expected length distribution of their sequenced amplicon is, they should know that (but no guarantees).

Nuisance33 commented 5 years ago

Ah they did have the length distribution. Most are around 450 bp. Do you have a recommendation of what settings I should use? final_len_distribution

benjjneb commented 5 years ago

This looks like Illumina V3V4, in which case your truncation lengths should be long enough. Can you try extending them a bit (say to truncLen=c(245,245)) and post the new read stats?

Is this Illumina 2x250? I would double-check whether the primers have been removed. I suspect they haven't, if the reads you are getting are 250nts long (or 300nts long).

Nuisance33 commented 5 years ago

Yes V3V4 region and Illumina 2x250 is correct. The reads I am getting are 250 bp.

I don't see the primers when I open the fastq files and I tried using Cutadapt to remove the primers, but it doesn't seem like they were there. Tried again extending the non-truncated length, but still losing most of my reads.

out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(245,245),

maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,

compress=TRUE, multithread=FALSE)

      input filtered denoisedF denoisedR merged nonchim

LT01-UTC1 33402 29438 23769 25421 11556 7695 LT02-UTC2 33644 29364 24446 25606 12087 9153 LT03-UTC3 31309 27617 22946 24193 11830 8432 LT04-UKE1 34064 29317 23437 25018 10563 7190 LT05-UKE2 33502 29137 23377 24980 11021 6249 LT06-UKE3 32618 28527 22676 23971 10184 7087

benjjneb commented 5 years ago

I don't know.

Would you be able to send me one sample (forward and reverse fastqs) to look at?

Nuisance33 commented 5 years ago

Absolutely! See attachment

Thank you so much :-)

Primer set: Read1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA

Read2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

LT01-UTC1_R2_001.fastq.gz LT01-UTC1_R1_001.fastq.gz

benjjneb commented 5 years ago

Is it possible that heterogenitity spacers were used to generate this data? What I am seeing is that there are variants of each real ASV that differ in the first few nts, and are shifted by 1-3 nts relative to one another. This pattern is associated with heterogeneity spacers that are sometimes introduces to make amplicon libraries more diverse across each position.