FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
461 stars 150 forks source link

No reads end with "A" after trimming #129

Closed hkaspersen closed 2 years ago

hkaspersen commented 2 years ago

Hello, and thanks for creating a great tool!

We used TrimGalore to trim shotgun metagenomic datasets sequenced on NovaSeq with the ThruPLEX library prep. We noticed that after trimming with the auto-detect setting, none of the reads in all our samples ended with an "A". We were wondering if this was something to do with the index sequence not being removed, or if something happened in the trimming? I'm not sure if this is something to worry about, just thought it was peculiar and would like to know why!

This is the command we used:

trim_galore -o $output --paired --quality 15 $R1 $R2

These are the adapter sequences that the sequencing provider used (not used in the trimming command): --adapter1 AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC --adapter2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA

This is the fastqc report for per base sequence content before trimming: per_base_s_c_raw

And after: per_base_s_c_trimmed

FelixKrueger commented 2 years ago

Hi @hkaspersen

This behaviour is expected, and is a result of using a very strict stringency by default:


--stringency <INT>      Overlap with adapter sequence required to trim a sequence. Defaults to a
                        very stringent setting of 1, i.e. even a single bp of overlapping sequence
                        will be trimmed off from the 3' end of any read.

As your adapter sequences both start with an A (correctly auto-detected from AGATCGGAAGAGC), none of the trimmed sequences will end in an A. While this looks quite drastic in the FastQC report, it will certainly not have a noticable impact on mapping of 150bp long sequences.

Just for the background, Trim Galore was originally meant to assist with bisulfite sequencing, where sequences are not only aligned to a certain locus in the genome, but individual bases also take part in calling the methylation state of cytosines. For this purpose we set the --stringency to a very strict setting of 1, I believe the default for Cutadapt on its own for genomic sequences is 3. While this did matter back somewhat in 2010 when sequences were only 28 or 36bp long, these days I think one can comfortably deal with generously removing suspected adapter sequences. Does this answer your question?

hkaspersen commented 2 years ago

Thank you for a swift reply! Yes, this is very useful. I think we can safely ignore the issue then. Thanks!