benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution
http://benjjneb.github.io/dada2/
GNU Lesser General Public License v3.0
469 stars 142 forks source link

Large ammount of ASVs and large amount of read loss during chimera removal #1865

Closed robbueck closed 4 months ago

robbueck commented 10 months ago

Hi, I'm running dada2 on a dataset of 7.000 human gut samples for the V3-V4 region using primers 341F/785R (published here: https://www.thelancet.com/journals/ebiom/article/PIIS2352-3964(23)00260-8/fulltext).There was a substantial amount of adapter content in many samples, which I removed with trimmomatic. I'm observing a low quality at the start of the reads, which I remove with trimLeft = 40. I trim everything to the same length with truncLen=245. image

After merging, I obtain 11M ASVs, of which 90% have a read count of 10 or less. I remove those low count ASVs before chimera removal. During chimera removal, I then loose around 50 % of the reads. Here is what I'm losing at different steps with regard to the previous step, denoising includes merging: image

As I trim 40bp in the beginning of the reads, I assume I should not have any primer content anymore and adapter were also removed, so I'm not sure, where these high amounts of chimeric reads could come from. Does anyone have any idea here?

benjjneb commented 10 months ago

I'm observing a low quality at the start of the reads, which I remove with trimLeft = 40.

This is not recommended. trimLeft should be used to remove primers at the start of the sequences. so trimLeft=c(FWD_PRIMER_LENGTH, REV_PRIMER_LENGTH).

I trim everything to the same length with truncLen=245.

Again, I would not recommend having the same truncLen for forward and reverse reads. Instead I would inspect quality profiles for forward and reverse reads separately, and choose appropriate truncLen for each. There will be limited ability to vary truncation lengths due to the need to maintain overlap with this primer set.

a substantial amount of adapter content in many samples, which I removed with trimmomatic

How are you using trimmomatic? Are you using it to remove reads with "adapter content"? Or, are you using it to trim reads to remove adapters (and maybe primers too)? If you are using it in that second way, that can cause major problems if it introduces variation in length and/or starting position into the post trimmomatic reads.

In many cases, it causes more trouble to try to use trimmomatic instead of simply using the standard dada2 filterAndTrim approach (i.e. trimLeft + truncLen) when it comes to downstream processing in the dada2 pipeline.

I'm not sure, where these high amounts of chimeric reads could come from. Does anyone have any idea here?

You are right that this is too high a loss at the chimera step, something else is going on here. I would recommend revisiting your processing workflow by removing trimmomatic, revisiting filterAndTrim as per above, and re-assessing read fates through the pipeline. You don't need to be using all 7k samples for this, just a small number for testing.

If that doesn't largely resolve the issue, then it is worth exploring other more unusual possibilies (e.g. heterogeneity spacers, mixed length primers, largescale off-target amplification etc).

robbueck commented 9 months ago

Thanks a lot for your answer. Setting trimLeft to the primer length slightly increases the number of chimeras and the read loss during filtering, but the effect is not very strong.

There will be limited ability to vary truncation lengths due to the need to maintain overlap with this primer set.

I chose the same value for truncLen to trim all reads to 245, as this is the largest common read length across all samples. If I increase truncLen for either forward or reverse, I loose 2/3 of my samples, but also the loss of reads during chimera filtering is reduced.

image

How are you using trimmomatic? Are you using it to remove reads with "adapter content"? Or, are you using it to trim reads to remove adapters (and maybe primers too)? If you are using it in that second way, that can cause major problems if it introduces variation in length and/or starting position into the post trimmomatic reads.

I'm using trimmomatic to remove adapter sequences. As adapter content is quite dramatic, it mostly removes whole reads. The majority of the trimmed reads are either 300 or 250 bp, as the original reads. I'm also wondering if those samples with an adapter content of > 80% are usable at all or if removing the majority of the reads would also disturb most biological signals. image

benjjneb commented 9 months ago

I'm also wondering if those samples with an adapter content of > 80% are usable

Yes.

Overall this data according to that fastQC plot has a problematically high level of adapter contamination. Amplinco nsequencing of V3V4 should not be be producing that many sequencing reads that include adapters, if things worked right.

I'm not sure, where these high amounts of chimeric reads could come from. Does anyone have any idea here?

I would reorient my concern towards the fundamental intergrity of this data. First step would be to simplify by dropping the reverse reads. Second step would be to remove cutadapt from the workflow, and then do manual inspection of R1 reads to see if they are as expected (FWD primer at the start, always). Then maybe there are choices.

The purpose of microbiome sequencing is to obtain an informative description of the real microbial community. Usually this means estimating parameters such as which microbes are there and how much of a proportion do they make up in the community. Keeping bad reads and bad data in the sequence table counts is not a purpose, and often is at cross-purposes to the real goals.