FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
467 stars 151 forks source link

TrimGalore removes overrepresented sequences only from R2, R1 untouched. #108

Closed kvn95ss closed 4 years ago

kvn95ss commented 4 years ago

Hello,

I've used Trim Galore on paired WGS data which was sequenced on Illumina platform.

R1 had the following over represented sequence - ATCGGAAGAGCACACGTCTGAACTCCAGTCACGTGGCCTTATCTCGTATG | 1189128 | 0.3527360036850033 | TruSeq Adapter, Index 20 (97% over 43bp)

and R2 had this - ATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGT | 800455 | 0.23744230884284898 | Illumina Single End PCR Primer 1 (100% over 50bp)

I used the default parameters, and it picked Illumina universal adaptor sequence 'AGATCGGAAGAGC'. The exact command I ran was -

trim_galore --paired --cores 18 file_R1.fq file_R2.fq

Post trimming, both files failed for Per GC Content. R1 still has the over represented sequences left in it but R2 doesn't have any over represented sequences - ATCGGAAGAGCACACGTCTGAACTCCAGTCACGTGGCCTTATCTCGTATGCCGTCTTCTGCTTGAAAAA | 395388 | 0.11793235964678503 | TruSeq Adapter, Index 20 (97% over 43bp)

My questions are -

  1. Since the initial sequences are same, I'm wondering why the trimming didn't work for the R1 file.

  2. From my reading, I came to know the default stringent setting has lead to failure of Per GC Content and the stringent criteria is mainly for BS-Seq. Can I change the setting to 4-5? Would be good enough for the data as it's WGS?

  3. Are there any additional settings which I can use to improve the trimming, since I ran the tool in default mode?

FelixKrueger commented 4 years ago

Hi @kvn95ss

To 1): The Illumina adapter sequence read-through contamination does indeed start with the sequence AGATCGGAAGAGC, you can see a cumulative plot of this type of contamination in the FastQC Adapter contamination plot. After trimming, the contamination plot in FastQC should be completely flat.

Over-represented sequences on the other hand (in FastQC terms) are different. Here, the first 50bp of each sequence are taken and tested for exact matches in your entire data file. As you can seen, the two sequences you linked do not contain the Illumina standard adapter sequence, and therefore not trimmed (and rightly so). Both sequences are minor contaminants (some 0.2-0.3% of all reads), but since they are artificial sequence they will simply not align to any genome later on. You could trim this specific sequence as a proof of principle by specifying -a ATCGGAAGAGC, but since the sequences are not going to align anyway I would personally just skip this step.

To 2): For WGS it is probably fine to go a bit easier on the stringency - I believe the default in Cutadapt is also 3bp minimum. Just set --stringency 3 to effect this change.

To 3): There are plenty of options to choose from (see --help for all options), but only because there are more options available it doesn't really mean you should be using them. If anything I would probably relax the stringency somewhat, but otherwise I would use the default options as they should work for most scenarios. As a first attempt I would probably use:

trim_galore --paired --gzip --stringency 3 --cores 4 file_R1.fq file_R2.fq

I hope this helps.

kvn95ss commented 4 years ago

Hay @FelixKrueger

Thanks for the reply.

In the over represented sequences, if you remove the first 2 nucleotides from the illumina universal adapter sequence it should ` ATCGGAAGAGCACACGTCTGAACTCCAGTCACGTGGCCTTATCTCGTATG # R1 before trimming ATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGT # R2 before trimming ATCGGAAGAGCACACGTCTGAACTCCAGTCACGTGGCCTTATCTCGTATGCCGTCTTCTGCTTGAAAAA # R1 after trimming

AGATCGGAAGAGC

ATCGGAAGAGC # removing the first 2 bases `

so I was wondering why it wasn't trimmed. Anyways, since they won't be aligning to the genome (plus them making up only 0.2%, and 0.1% after trimming) there shouldn't be an issue with variant calling.

I'm running trim_galore with less stringent parameters and might use that data for processing.

FelixKrueger commented 4 years ago

The way the trimming works is the following: if there is a partial match at the end, the partial match is trimmed.

If there is a full match to the adapter sequence, the sequence gets trimmed at the start of the contaminant, and from there to the end of the sequence. The read sequence needs to match the adapter sequence in full, however some error is tolerated (in default conditions, the tolerated error rate is 0.1, which means for a 13 bp sequence (AGATCGGAAGAGC) that 1.3, or rounded down, 1 error is acceptable. Since in your example the first 2 bases are not present, the sequence does not count as a legit contamination, and is therefore untouched.

Good luck!