FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
461 stars 150 forks source link

After trimming 10 bp at 5', the GC content at 3' changed oddly #136

Closed XiaoTW123 closed 2 years ago

XiaoTW123 commented 2 years ago

Hi, Felix Krueger I used fastqc before trimming, and found that the 'per base sequence content' of 5' was not good, so i used the following command for quality control (removing adapters, low quality reads, and 10 bp at the 5'): trim_galore --paired --quality 20 --fastqc --cores 8 --clip_R1 10 --clip_R2 10 --basename ngs_trimmed xx.R1.fastq.gz xx.R2.fastq.gz After trimming, the 'per base sequence content' of 5' looks good, but 3' changed oddly. Is there anything wrong with my command ? Wish to here from you soon. Thanks.

image

image

FelixKrueger commented 2 years ago

The reason for the funky looking 3' ends is simply a consequence of strict adapter trimming. As the standard Illumina adapter starts with AGATC..., it simply means that reads may never end in A (as this could be the first base of adapter, and is therefore trimmed. Equally, ends AG, AGA, AGAT, AGATC .... are all trimmed off the ends of reads; as a result you get this drop of A for the very last base, with the other 3 bases taking over.

Is your data RNA-seq by any chance? In those cases one typically finds biased positions at the start, but they don't tend to interfere with anything dramatically (and are typically not removed). Removing them will probably only shift the start/end coordinates by 10bp.

XiaoTW123 commented 2 years ago

The reason for the funky looking 3' ends is simply a consequence of strict adapter trimming. As the standard Illumina adapter starts with AGATC..., it simply means that reads may never end in A (as this could be the first base of adapter, and is therefore trimmed. Equally, ends AG, AGA, AGAT, AGATC .... are all trimmed off the ends of reads; as a result you get this drop of A for the very last base, with the other 3 bases taking over.

Is your data RNA-seq by any chance? In those cases one typically finds biased positions at the start, but they don't tend to interfere with anything dramatically (and are typically not removed). Removing them will probably only shift the start/end coordinates by 10bp.

Thanks for your reply. Things i described above happened both in my illumina data and HiC reads.

FelixKrueger commented 2 years ago

If by Illumina data you mean you mean Illumina RNA-seq data, then that's to be expected (and I would probably not trim off the 5' ends). For the Hi-C data you might want to check with the protocol whether some enzymatic restriction sites are expected on the 5'-end that need to be kept in place.

In any case, I hope it became clear that the phenomenon on the 3' is expected and nothing to worry about.