FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
472 stars 151 forks source link

adapter+quality trimming followed by hard-trimming #130

Closed pmenzel closed 2 years ago

pmenzel commented 2 years ago

Hi,

would it be possible to run the default adapter/quality trimming followed immediately by hard trimming to a maximum length (option --hardtrim5) in one go? So that the output files would be the same _val_*.fq(.gz) as for normal trimming.

best wishes!

FelixKrueger commented 2 years ago

Hi @pmenzel

I can see that this might be occasionally useful, however:

pmenzel commented 2 years ago

Thanks for looking into it!

* `--hardtrim5` is supposed to generate a new file with a defined sequence length. If you combine hard-trimming with adapter/quality trimming, a defined sequence length can now longer be guaranteed

You mean guaranteed in the sense that all reads are exactly of length N? But that would also depend on the input file, which might have reads shorter than N in the first place. But I can understand that it is treated as a separate step in Trim Galore

* Trim Galore has the options to trim sequences from the 3' end (`--three_prime_clip_r[12]`), even though this can be difficult to 'get right', again because of the variability of the adapter and quality process

* Trim Galore also has the options to select sequences/ sequence pairs based on a minimum (`--length`) and maximum (`--max_length`) read length. Don't you think you could select a combination of these options to suit your needs, or at least get you very close to what you had in mind?

Unfortunately, these options are not what --hardtrim5 is doing.

What I have specifically in mind is removing that last extra cycle in Illumina FASTQ files (e.g. cycle 151, 251, etc.), which often has bad quality or miscalled Gs (NextSeq), and should not be included in downstream analysis.

FelixKrueger commented 2 years ago

I can see your point also, I'm just trying to get away with the options that are already there.... :)

One more try: If you would select --three_prime_clip_r1 1 --three_prime_clip_r2 1, this would guarantee to take of 1 bp from the 3' end, wouldn't it? The downside would of course be that if a sequence had already been trimmed by adapter or quality trimming, you would lose one additional bp for these reads. Not ideal, but maybe tolerable for reads that long?

pmenzel commented 2 years ago

Yeah, I saw that option too, and as you said, one would always loose one base.

My specific dataset is an amplicon panel, in which the R1 and R2 reads of some amplicons barely overlap, so that is a case where every base counts. :)

Anyways, I just can run trim_galore twice and get the desired result.
Was just wondering if there would be a way to run it only once.

FelixKrueger commented 2 years ago

Ok, if that works for you that would save me from implementing another option. If it was your only chance to get what you need I might be persuaded to take another look, but it seems that you seem happy enough to go with what we have right now. Cheers, Felix