FelixKrueger / TrimGalore

A wrapper around Cutadapt and FastQC to consistently apply adapter and quality trimming to FastQ files, with extra functionality for RRBS data
GNU General Public License v3.0
462 stars 150 forks source link

question about --rrbs of v0.4.3 #7

Closed hxlei closed 7 years ago

hxlei commented 7 years ago

Hi. I have read the link in the update but I'm still confused that why the end of Read 2 is not affected by the artificial methylation states introduced by the [end-repair] fill-in reaction. According to the link, the mix of nucleotides fill in fragments with a 5′ overhang. So I think the end of Read 2 should be affected too. Could you please explain more clearly ?

Besides, I have another question. I have two batches of pe RRBS. When I use version 0.4.3, the first batch seems fine. But in the report of the second batch, there are warnings like this:

Bases preceding removed adapters: A: 2.3% C: 0.6% G: 90.5% T: 4.2% none/other: 2.5%

WARNING: The adapter is preceded by "G" extremely often. The provided adapter sequence may be incomplete. To fix the problem, add "G" to the beginning of the adapter sequence.

While processing the two batches, all arguments are the same: --rrbs --length 20 -s 3. I wonder what could leads to such warning appearing. (I think if there are adapters in reads indeed, then most bases preceding removed adapters should be G/A.)

Thank you very much !

FelixKrueger commented 7 years ago

Read 1 will start with [CT]GGNNNNNNNNNNNNNN irrespective of whether it came from the top or bottom strand, so let's just limit ourselves to the top strand for simplicity.

Read 1 will look like this (it may or may not read into the adapter on the 3' end (sequence in bold is from the MspI site, A is adapter sequence): R1: [CT]GGNNNNNNNNNNNNNNCCGAAAAAAA

You can see here that the adapter should be always preceded by a TG, with T being the filled in artificially converted residue, and the G being part of the CCGG recognition sequence of MspI. The warning message about G being very frequent is issued by Cutadapt, but you don't need to specify this as the adapter sequence because the option --rrbs will remove both TG to avoid including the bias.

Read 2 now on the other hand is a reverse complement to Read 1, so if Read 1 had the bias at the 3' end, Read 2 will have it right at the start:

R1: 5' ...NNNCTG AAAAAAA 3'
R2: 3' ...NNNGAC AAAAAAA 5'

Since the TG of Read 1 is artificially unmethylated in Read 1, so is the CA in Read 2 (A being the residue that is called as unmethylated for Read 2).

The end of Read 2 should look like this:

R1: 5' AAAAA [TC]GGNNNNN... 3'
R2: 3' AAAAA [AG]CCNNNNN... 5'

Since the T or C at the start of Read 1 is not filled in reflects the genomic methylation state. And since Read 2 is carbon copy of Read 1, so will the A or G in Read 2.

This is why the --rrbs option removes the last 2 bases before hitting the adapter in Read 1, but removes the first 2 bases of Read 2. Let me know if I was unclear anywhere.