FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
472 stars 151 forks source link

Trimming for xGen™ Methyl-Seq DNA Library Prep Kit #174

Closed abearab closed 1 year ago

abearab commented 1 year ago

Hi There,

I need some technical assistance. I'm trying to setup a pipeline for WGBS data prepped using xGen™ Methyl-Seq DNA Library Prep Kit. I could get ~60% mapping efficiency using TrimGalore + Bismark pipeline with TrimGalore's auto detection of adaptor sequences.

Now I am aiming to go for a higher mapping efficiency as promised here – i.e. 70%. To my understanding this is their suggestion:

trim 10 bases from the end of R1 (3’ end) and 10 bases from the beginning of R2 (5’ end) to remove tail sequences

Here is my script which made the mapping efficacy even worse, ~40%:

trim_galore --core 10 --three_prime_clip_R1 10 --clip_R2 10 --paired -o 5b_R1.fastq.gz 5b_R2.fastq.gz

@FelixKrueger, do you have any idea how this can be resolved? I had great experience asking technical questions here and your code maintenance and responsiveness is appreciated!

FelixKrueger commented 1 year ago

This sounds a little odd. Would you be able to provide some test sequences for me to take a look at? (some 200K reads untrimmed, gzipped) should fit in an email. Cheers

abearab commented 1 year ago

Yeah, it is. I'll send the test fastq files shortly.

FelixKrueger commented 1 year ago

Hi Abe,

I really don't know what went wrong, but your data looks absolutely lovely!

The data appears to be Accel Swift data, with a hefty bias of Gs at the start of Read 2:

Screenshot 2023-09-20 at 10 19 44

According to our trimming recommendations for this type of library (see here) I went ahead and trimmed the data like so:

trim_galore --three_prime_clip_R1 10 --clip_R2 20 --clip_r1 10 --three_prime_clip_r2 10 --paired  5b_R1.fastq.gz 5b_R2.fastq.gz

Then, using default Bismark alignments to the human genome I achieved the following stats:

Sequence pairs analysed in total:   192138
Number of paired-end alignments with a unique best hit: 153633
Mapping efficiency: 80.0%

...

C methylated in CpG context:    80.2%
C methylated in CHG context:    0.7%
C methylated in CHH context:    0.7%

Which to looks very good indeed. You might get away with trimming only 15bp from R2, but this is really personal preference. I hope this helps?

abearab commented 1 year ago

Cool, thanks for doing this. I can also get the same Mapping efficiency!! I'm closing this issue for now, I'll stay in touch if I have more questions :)