FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
467 stars 151 forks source link

Incomplete trimming of Illumina adapters #119

Closed ag1805x closed 3 years ago

ag1805x commented 3 years ago

I am trying to trim WGBS data using TrimGalore but I observed that it is unable to trim all occurrences of the Illumina adapter (AGATCGGAAGAGC ) with default parameters. FastQC after trimming shows TruSeq adapter presence ( mostly starting with GATCGGAAGAGC...).

I tried changing the -e parameter and here are my observations: -e 0.1 : (default) not all adapters removed -e 0.05 : adapter retention high (lower performance than 0.1) -e 0.5 : adapter removed but total sequences halved and number of duplicated sequences increased.

Should --stringency be low value (i.e. 1) and -e be high value (i.e. 1)? Is there any other parameter that could be adjusted to solve this issue?

Using -a GATCGGAAGAGC improved performance but in some cases adapters still remain.

FelixKrueger commented 3 years ago

Hi @ag1805x

Trim Galore is intended to identify and remove read-through adapter contamination which in your case is AGATCGGAAGAGC. The presence of TruSeq adapters, or adapter dimers may occur as well, but this is a different issue than read-through contamination. As such, the behaviour to remove AGATCGGAAGAGC contamination, but not GATCGGAAGAGC, is both correct and expected.

In a bit more detail: If you see TruSeq adapters in the sample that start with GATCGGAAGAGC... this is really only the adapter, or a dimer of itself. This sequence will not align to any genome, so I simply not bother about it, it will effectively be removed in the alignment step.

From the trimming point of view the sequence AGATCGGAAGAGC with the extra A from A-tailing, cannot produce a good match the TruSeq adapter:

GATCGGAAGAGC...     "reference"
     | |
AGATCGGAAGAGC.      adapter

The only option would be to allow so many mismatches that - as you said - it trims more or less random stuff as well.

I personally would only use Trim Galore in the default mode, and simply forget about some TruSeq dimer sequences (or maybe change the sample prep somewhat so you don't get these) as they drop out in the mapping step anyway.

As you have

ag1805x commented 3 years ago

I wouldn't have bothered much if it was RNA-Seq data. But this is WGBS data. I think in one of the tutorials it was mentioned: "adapter contamination may in a Bisulfite-Seq setting lead to mis-alignments and hence incorrect methylation calls". Can I afford to retain the adapters?

Using Trimmomatic does solve the issue though: trimmomatic SE -threads 40 data.fastq.gz data_clean.fastq.gz ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 TRAILING:20 LEADING:20 MINLEN:20

FelixKrueger commented 3 years ago

The [read-through] adapter contamination, so reads that have a just a few bases of adapter on their 3' ends, may under certain circumstances be aligned to incorrect places (depending on mapping parameters).

Full length Illumina TruSeq adapters have no resemblance to the genome, and will thus not map (and also not result in incorrect methylation calls). I am pretty sure that removing or ignoring them will give the exact same results.