FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
472 stars 151 forks source link

Trim_galore directional vs. non-directional mode #177

Closed shaghayeghsoudi closed 1 year ago

shaghayeghsoudi commented 1 year ago

Hi Felix,

If you remember my email last week where I shared my fastqc and Trim_galore results with you. As I said we used Diagenode Premium RRBS kit V1 to process our Sarcoma tumour and normal samples and as you also mentioned our R1 and R2 reads seemed swapped. I was reading kit V2 manual (which is actually the same as V1 with some extra explanations) and I noticed they have some recommendations for bioinformatics part. They recommend to do non-directional trimming and then --pbat mode for the alignment.

I have done several different runs and here I am attaching some of my results:

Untitled

If you look at the attached image, It seems running Trim_galore with --non-directional mode gives the results that we expect and trims 2 bp from 5end of R1 and keeps the first 2 bp from the 5 end of R2 which are indeed true methylated sites. On the other hand by running directional mode I lose my true methylation sites on R2 reads. I am really confused to understand the logic of how directional vs. non-directional modes trims reads and why running --non-directional mode fix my reads? I appreciate if you could provide some explanations and why we see this pattern.

I have contacted Diagenode to know why R2 and R1 are swapped but it seems that's the way is designed and works. It seems trimming non-directional and aligning --pbat gives the results I want without any need for swapping the reads.

FelixKrueger commented 1 year ago

Ok, I'll give it a go. Please see the attached slides for illustration. Potential pitfalls with RRBS libraries.pptx

Directional libraries

For directional the situation is fairly straight forward (slide 3, OT and OB strands).

single-end: For SE libraries, you can have the situation that reads are shorter than the fragment size so you never sequence into the adapter on the 3' end: fine. If the read length is longer than the fragment, you will sequence into adapter and thus need to remove 2bp prior to adapter. So far so good.

paired-end: For paired-end reads, R1 is the same as in the single-end situation. Read 2 technically looks like a read from CTOT or CTOB, so the second base from the start (5' end) contains the base opposite the filled in C, and this base is almost certainly called unmethylated.

As a consequence, for directional paired-end reads, R1 needs to be trimmed by 2bp prior to adapter read-through contamination (if present), while R2 needs to be trimmed by 2bp at the 5' end invariably.

Non-directional libraries

True non-directional libraries are typically a mix where 50% of the reads have OT or OB as Read 1, which is the exact same situation as outlined above. In the other 50% of the time, R1 will be from the CTOT or CTOB strand, and thus, you would need to trim 2bp off the 5' end of R1, and remove 2 additional bp just before the start of the adapter contamination of R2.

To capture all potential fill in biases you would now have the choice to either remove 2bp at the start of both R1 and R2, as well as before read-through adapter contamination in both reads. This may be a little wasteful but is the conservative approach. The other option would be to scan R1 if it starts with CAA (fill-in with unmethylated cytosine) or CGA (fill-in with methylated cytosine), and only 5' clip then, and similarly only trim on the 3' end if necessary. The latter option is what Trim Galore does in --non_directional mode.

Your case

Now your case is different from both these scenarios, and in fact looks like a PBAT version of a directional RRBS experiment.

Conclusions

  1. Since the --rrbs trimming mode in Trim Galore does not expect a PBAT-style RRBS setup, I think you should consider a modified version of trimming, namely:
trim_galore --paired --clip_r1 2 --three_prime_clip_r2 2 sample_R1.fastq.gz sample2.fastq.gz

followed by doing --pbat alignments in Bismark.

  1. Regarding Diagenode. I am afraid their recommendations are currently misleading (at the time of writing 21 March 2024, Diagenode have agreed to change the recommendations for the V2 kit, see post below). Feel free to send them my way, and/or read this thread, but they should really change their recommendation and provide documentation for how you end up getting a PBAT version of a directional RRBS library. Who knows, maybe it is time for a --diagenode option in Trim Galore?

I hope this helps de-mystify the questions you had?

shaghayeghsoudi commented 1 year ago

Thanks so much Felix for the clarification. Sure I will share your explanations with them.

FelixKrueger commented 8 months ago

As a quick follow-up: I just hopped off a call with the team at Diagenode, who confirmed that the Premium RRBS kit V2 produces libraries where the read carrying CGG/TGG as the first three bases gets sequenced as R2, and the read starting with CAA is sequenced as R1. This behaviour is achieved by using the flowcell adapters in an inverse order, thereby mimicking an RRBS sample that was prepared in a PBAT fashion. Of note: the kit also introduces UMIs into the index read, which can be a *very useful feature for RRBS data!

So in conclusion, as I have already suggested above, it seems that the V2 kit requires specific trimming recommendations:

  1. trim_galore --paired --clip_r1 2 --three_prime_clip_r2 2 sample_R1.fastq.gz sample2.fastq.gz*, followed by
  2. Bismark alignments using --pbat

The team at Diagenode are going to change the recommendations on their website accordingly. I will now amend my comment above to tone it down a little. Thanks @Diagenode for taking action on this!

shaghayeghsoudi commented 8 months ago

Thanks a lot @FelixKrueger . Really appreciate your response