FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
459 stars 149 forks source link

RRBS data trimming-CGA problem #170

Closed MilosCRF closed 1 year ago

MilosCRF commented 1 year ago

Hi!

I have 150bp PE data from the non-directional rrbs library. When I trim the data with --rrbs and --non-directional, the CGA still remain at the beginning of reads.
Is it correct? Should not be the CG trimmed out?

my code is: trim_galore --paired --fastqc --gzip --nextseq 30 --rrbs --non_directional --cores 14 --length 20

Thanks a lot Milos

per_base_sequence_content

FelixKrueger commented 1 year ago

Hi @MilosCRF

Sorry for the late reply but I was out of the office. I think I haven't dealt with non-directional RRBS data in well over a decade, and I have to admit that I find it quite confusing. I have studied the attached powerpoint file (which is at least as old...), and I am beginning to wonder whether your data is non-directional al all? In the plot above, are you showing R1 or R2? Could you also send the profiles before trimming (both R1 and R2)? Is there are chance that these are standard (directional) RRBS libraries and you performed the fill-in reaction with methylated Cs?

Potential pitfalls with RRBS libraries.pptx

MilosCRF commented 1 year ago

Hi @FelixKrueger

many thanks for your reply. We used Zymo-Seq RRBS Library Kit so the data should be non-directional as stated in the kit's protocol on page 13. https://files.zymoresearch.com/protocols/d5460_d5461_zymo-seq_rrbs_library_kit.pdf The kit uses 5-methylcytosine dNTP Mix, so the fill-in reaction was performed with methylated Cs. I was considering hard trim these 3 bases to avoid any bias. The plots are below.

Many thanks once again Milos

The plot I sent before was R1 after trimming. Here is the plot from R2 after trimming: per_base_sequence_content

And here are plots (R1 followed by R2) for data before trimming: per_base_sequence_content per_base_sequence_content

FelixKrueger commented 1 year ago

thanks for these additional plots. It makes a lot of sense that you used methylated dNTPs, so you should definitely remove these bases as they introduce incorrect methylation states.

I am wondering however whether these libraries are really non-directional as Zymo suggest, because if they are were non-directional you would probably want to see a fair amount (say 30-50% of reads) starting with TGG or CGG (they might be, but it's difficult to see here). When you align your reads in --non_directional mode, what is the split of alignments between the OT/CTOT/OB/CTOB strands?

Given that your reads are 150bp long, I would probably hard-trim the reads by 2bp on both sides in addition to adapter trimming (--clip_r1 2 --clip_r2 2 --three_prime_clip_r1 2 --three_prime_clip_r2 2, and drop the --rrbs option. Back in the days when reads were still 36bp long one really didn't want to discard too much of the read length, but this no longer is an issue).

MilosCRF commented 1 year ago

@FelixKrueger thanks for your help. Well, I tried to run one sample in Bismark using both modes --non-directional and --directional.

Final Alignment report --non-directional

Sequence pairs analysed in total: 8019705 Number of paired-end alignments with a unique best hit: 4479072 Mapping efficiency: 55.9% Sequence pairs with no alignments under any condition: 1140823 Sequence pairs did not map uniquely: 2399810 Sequence pairs which were discarded because genomic sequence could not be extracted: 0

Number of sequence pairs with unique best (first) alignment came from the bowtie output: CT/GA/CT: 976448 ((converted) top strand) GA/CT/CT: 1251854 (complementary to (converted) top strand) GA/CT/GA: 1264730 (complementary to (converted) bottom strand) CT/GA/GA: 986040 ((converted) bottom strand)

Final Alignment report --directional

Sequence pairs analysed in total: 8019705 Number of paired-end alignments with a unique best hit: 2722604 Mapping efficiency: 33.9% Sequence pairs with no alignments under any condition: 3784607 Sequence pairs did not map uniquely: 1512494 Sequence pairs which were discarded because genomic sequence could not be extracted: 0

Number of sequence pairs with unique best (first) alignment came from the bowtie output: CT/GA/CT: 1342844 ((converted) top strand) GA/CT/CT: 0 (complementary to (converted) top strand) GA/CT/GA: 0 (complementary to (converted) bottom strand) CT/GA/GA: 1379760 ((converted) bottom strand)

From this, I would say that the library is indeed non-directional. Am I correct? I will definitely stick with hard-trimming the reads by 2bp on both sides.

Thank you very much Milos

FelixKrueger commented 1 year ago

Yes, I agree. The library aligns non-directionally, with a slight preference for the CTOT and CTOB strands. Good luck!