FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
459 stars 149 forks source link

Does TrimGalore do soft clipping by default? #189

Closed tarunaaggarwal closed 4 months ago

tarunaaggarwal commented 4 months ago

Hello,

I think the answer is yes based on the options' descriptions but I wanted to confirm this anyway. Does TrimGalore do soft clipping of the reads by default based on the given quality threshold? I ran the following command and wanted to make sure my reads were soft clipped because I need to mark duplicates in the bam files and hard clipping can cause errors. Thank you so much!

~/pkgs/TrimGalore-0.6.10/trim_galore --paired \
Sample1_L005_R1_001.fastq.gz Sample1_L005_R2_001.fastq.gz \
--basename Sample1 \
-q 15 \
--length 50 \
-o outDir \
--retain_unpaired \
-j 8
done
FelixKrueger commented 4 months ago

Trim Galore either clips (= trims), or it doesn't. At this stage there is no further distinction.

The concept of hard-clipping vs. soft-clipping comes in only at the alignment state, where the aligner in questions either ignores some bases (soft-clipping), or trims them off (hard-clipping).

And yes, Trim Galore does by default trim off bases using a quality threshold of 20, or in your case Phred 15.

tarunaaggarwal commented 4 months ago

Dear @FelixKrueger - thank you so much for your quick response! Is there a way to not trim off any bases from a read AND only drop reads with an overall quality score below a threshold, please?

I'm watching a tutorial on what duplicates are and how to mark them after mapping. The presenters say they don't recommend trimming bases off reads. When bases are trimmed off from either end of a read, that's hard clipping, right? If so, is there a way to not do that with TrimGalore?

I really love your tool! I tried a few trimming tools before TrimGalore and it was the only one that got rid of all the adapter reads from my fastqs.

FelixKrueger commented 4 months ago

Generally there are almost two schools of thought. Either, you remove sources of error from reads before they even get fed into a processing, e.g. an alignment tool, using a read trimmer; alternatively, you can try and let the aligner take care of these things themselves (which is what what is recommended in the GATK best practices which you linked in the tutorial). Several years ago I attended a GATK workshop, and I have to say I am still not entirely convinced I agree with that sentiment, at least least to completely.

What I find relatively uncontentious is removing the presence of read-through adapter contamination which happens every time when the read length is longer than then insert to be sequenced. This sequence can never help with variant calling further downstream in the GATK workflow - because it is adapter sequence...

If you are interested in variant calling, then they suggest also not trimming poor quality bases by simply using a threshold of e.g. Phred 20, because this may or may not have contained information for a SNV. I believe the argument is that even if a base has a poor basecall quality for a SNV position, the variant caller can factor in this poor quality and attribute a low confidence to it. Aggregated over all reads of an experiment there may potentially still be some information to be had from poor quality mismatches/SNVs, but frankly I don't think that the impact will be all that relevant. Depends on your question though.

In practical terms, may pre-processing workflows do remove adapter (and/or poor quality) sequences, even variant calling workflows implementing GATK best practices (e.g. https://nf-co.re/sarek/3.4.1). If you wanted to disable quality trimming altogether you could select -q 0, and only remove adapter contamination. I am afraid there is not functionality in Trim Galore (or Cutadapt?) to remove only reads with a poor quality overall, googling whether any tools exist to do this might yield and answer (you could of course write a script that would do this for you); frankly, I have never come across anyway wanting tom do this before though.

Does this help to make up your mind?

tarunaaggarwal commented 4 months ago

Hi @FelixKrueger - thank you for your thoughts! Perhaps I'm worried too much for no good reason. I am not calling variants but am rather interested in the relative abundance of genes in metagenomic assembled genomes. And I'd imagine based on what you wrote, the impact of these trimmed reads on my overall findings should be small.

I ran TrimGalore last night with a min len of 151 which is also my read length. I'm going to do a test on these reads and also on reads that I trim using a -q of 0. BUT I think I can proceed with my pipeline with the trimmed reads.

Thank you again very much for your advice!

Best, Taruna

FelixKrueger commented 4 months ago

Using a minimum length of 151bp will essentially discard everything that is either adapter or quality trimmed.

If I were you I would probably just run it in default mode, and not start altering every single option before even having looked at the results at all :)

tarunaaggarwal commented 4 months ago

Fair enough! I shall use the reads that were trimmed using the default parameters. I have those already, fortunately.