Closed shaghayeghsoudi closed 1 year ago
Ok, I'll give it a go. Please see the attached slides for illustration. Potential pitfalls with RRBS libraries.pptx
For directional the situation is fairly straight forward (slide 3, OT and OB strands).
single-end: For SE libraries, you can have the situation that reads are shorter than the fragment size so you never sequence into the adapter on the 3' end: fine. If the read length is longer than the fragment, you will sequence into adapter and thus need to remove 2bp prior to adapter. So far so good.
paired-end: For paired-end reads, R1 is the same as in the single-end situation. Read 2 technically looks like a read from CTOT or CTOB, so the second base from the start (5' end) contains the base opposite the filled in C, and this base is almost certainly called unmethylated.
As a consequence, for directional paired-end reads, R1 needs to be trimmed by 2bp prior to adapter read-through contamination (if present), while R2 needs to be trimmed by 2bp at the 5' end invariably.
True non-directional libraries are typically a mix where 50% of the reads have OT or OB as Read 1, which is the exact same situation as outlined above. In the other 50% of the time, R1 will be from the CTOT or CTOB strand, and thus, you would need to trim 2bp off the 5' end of R1, and remove 2 additional bp just before the start of the adapter contamination of R2.
To capture all potential fill in biases you would now have the choice to either remove 2bp at the start of both R1 and R2, as well as before read-through adapter contamination in both reads. This may be a little wasteful but is the conservative approach. The other option would be to scan R1 if it starts with CAA
(fill-in with unmethylated cytosine) or CGA
(fill-in with methylated cytosine), and only 5' clip then, and similarly only trim on the 3' end if necessary. The latter option is what Trim Galore does in --non_directional
mode.
Now your case is different from both these scenarios, and in fact looks like a PBAT version of a directional RRBS experiment.
CGG
or TGG
CGG
or TGG
and CAA
in R1. Which you do not see.CGG
or TGG
, and 100% or R1 starts with CAA
--rrbs
trimming mode in Trim Galore does not expect a PBAT-style RRBS setup, I think you should consider a modified version of trimming, namely:trim_galore --paired --clip_r1 2 --three_prime_clip_r2 2 sample_R1.fastq.gz sample2.fastq.gz
followed by doing --pbat
alignments in Bismark.
--diagenode
option in Trim Galore?I hope this helps de-mystify the questions you had?
Thanks so much Felix for the clarification. Sure I will share your explanations with them.
As a quick follow-up: I just hopped off a call with the team at Diagenode, who confirmed that the Premium RRBS kit V2 produces libraries where the read carrying CGG/TGG
as the first three bases gets sequenced as R2, and the read starting with CAA
is sequenced as R1. This behaviour is achieved by using the flowcell adapters in an inverse order, thereby mimicking an RRBS sample that was prepared in a PBAT fashion. Of note: the kit also introduces UMIs into the index read, which can be a *very useful feature for RRBS data!
So in conclusion, as I have already suggested above, it seems that the V2 kit requires specific trimming recommendations:
trim_galore --paired --clip_r1 2 --three_prime_clip_r2 2 sample_R1.fastq.gz sample2.fastq.gz
*, followed by--pbat
The team at Diagenode are going to change the recommendations on their website accordingly. I will now amend my comment above to tone it down a little. Thanks @Diagenode for taking action on this!
--rrbs
), but I think this would start to get quite confusing. Thanks a lot @FelixKrueger . Really appreciate your response
Hi Felix,
If you remember my email last week where I shared my fastqc and Trim_galore results with you. As I said we used Diagenode Premium RRBS kit V1 to process our Sarcoma tumour and normal samples and as you also mentioned our R1 and R2 reads seemed swapped. I was reading kit V2 manual (which is actually the same as V1 with some extra explanations) and I noticed they have some recommendations for bioinformatics part. They recommend to do non-directional trimming and then --pbat mode for the alignment.
I have done several different runs and here I am attaching some of my results:
If you look at the attached image, It seems running Trim_galore with --non-directional mode gives the results that we expect and trims 2 bp from 5
end of R1 and keeps the first 2 bp from the 5
end of R2 which are indeed true methylated sites. On the other hand by running directional mode I lose my true methylation sites on R2 reads. I am really confused to understand the logic of how directional vs. non-directional modes trims reads and why running --non-directional mode fix my reads? I appreciate if you could provide some explanations and why we see this pattern.I have contacted Diagenode to know why R2 and R1 are swapped but it seems that's the way is designed and works. It seems trimming non-directional and aligning --pbat gives the results I want without any need for swapping the reads.