FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
386 stars 101 forks source link

Function to remove reads with incomplete bisulfite conversion #76

Closed Shicheng-Guo closed 7 years ago

Shicheng-Guo commented 7 years ago

Hi Felix,

Do you think we can introduce a "function to remove reads with incomplete bisulfite conversion"

Regards,

Shicheng-Guo commented 7 years ago

-x, --rm-SX Removed reads with tag 'XS:i:1', which would be considered as not fully converted by bisulfite treatment [Default: False]

FelixKrueger commented 7 years ago

Hi Shicheng,

Filtering reads for incomplete conversion is a bit of a tricky issue, mainly because of the very definition of what is considered non-conversion. For model organisms such as human or mouse it is generally accepted that methylation in non-CG context is very low, so would you say reads containing a certain number of non-converted Cs in non-CG context classify for non conversion? How many non converted Cs would you require, a single one (which might arguably also remove true methylation calls and thereby bias your analysis), or perhaps three or more? Would they have to be consecutive or just anywhere in the read? Would interspersed CpG calls reset the count? For species such as plants which happen to have appreciable levels of non-CG methylation, what would the thresholds be?

Since I don't think that there is a straight forward solution for all cases I would probably rather suggest to run a quick filtering script on the resulting BAM file rather than adding a new option into Bismark directly. We have done this for some projects in the past, and I could dig out the code for such a filtering script for you if that would help?

Shicheng-Guo commented 7 years ago

I couldn't agree with you more. It would be a good idea to build some function to filter out the incomplete bisulfite conversion reads to the bam file. I think 'filterring incomplete bisulfite conversion reads' is quite important feature for bismark. I am focusing on human genome, therefore, it is quite easy for me. As you known, for human, CHH methylation level should be less <5%, when I find this methyaltion level is higher than 10% or more. I know there is something wrong with the bisulfite conversion.

for human, if the reads are not converted successfully, I think, we would observe CC, CA, CT in the reads (simultaneously), I think such kinds of reads should be removed.

We can also try to use the proportion of Cs in CHH, if the ratio is large then 5% for human, then it definitely would be non-converted reads,

Regards,

Shicheng-Guo commented 7 years ago

Hi Felix,

This is what BSSeeker2 done in this processing to non-converted reads. Filtering reads with incomplete bisulfite conversion

High bisulfite conversion rate is a critical factor for accurately estimating the methylation levels. Incomplete bisulfite conversion may lead to an overestimation of the methylation level. BS-Seeker2 provides a computational solution to remove reads with incomplete bisulfite conversion (Additional file 1: Supplementary Methods). Unmethylated phage DNA is often spiked into the samples as a control, to measure bisulfite conversion rates. We analysed the methylation pattern of the phage reads and observed two groups of reads from the distribution of unconverted cytosine sites: sporadically distributed and densely distributed. The sporadic group could be due to random un-conversion failure, or from T-to-C sequencing errors. The dense group is a set of reads that are almost entirely un-converted, potentially caused by the formation of secondary structure. We found that 82% of un-converted non-CpG sites were in the dense group, and only 18% were in the sporadic group. The same pattern was also observed on the mouse DNA data (Additional file 1: Figure S1). BS-Seeker2 provides an optional function to remove reads with densely un-converted non-CpG sites. To validate the feasibility, we mapped RRBS reads of two technical replicates (from the same mouse sample but different libraries), denoted as data sets A and B. The calculated methylation levels of non-CpG contexts show about 5-fold difference between the two replicates (Figure 4C). After removing the potentially un-converted reads, the methylation level gaps of non-CpG contexts were narrowed (to about 2-fold). However, this option is not suggested for samples with highly methylated non-CpG context, as it might reject bona fide methylated reads.

avilella commented 7 years ago

My 2c here:

I agree it would be good to add this functionality in the bismark package. Best option: post-processing after alignment.

Looking forward to it!

On Tue, Nov 15, 2016 at 2:22 AM, Shicheng Guo notifications@github.com wrote:

Hi Felix,

This is what BSSeeker2 done in this processing to non-converted reads. Filtering reads with incomplete bisulfite conversion

High bisulfite conversion rate is a critical factor for accurately estimating the methylation levels. Incomplete bisulfite conversion may lead to an overestimation of the methylation level. BS-Seeker2 provides a computational solution to remove reads with incomplete bisulfite conversion (Additional file 1: Supplementary Methods). Unmethylated phage DNA is often spiked into the samples as a control, to measure bisulfite conversion rates. We analysed the methylation pattern of the phage reads and observed two groups of reads from the distribution of unconverted cytosine sites: sporadically distributed and densely distributed. The sporadic group could be due to random un-conversion failure, or from T-to-C sequencing errors. The dense group is a set of reads that are almost entirely un-converted, potentially caused by the formation of secondary structure. We found that 82% of un-converted non-CpG sites were in the dense group, and only 18% were in the sporadic group. The same pattern was also observed on the mouse DNA data (Additional file 1: Figure S1). BS-Seeker2 provides an optional function to remove reads with densely un-converted non-CpG sites. To validate the feasibility, we mapped RRBS reads of two technical replicates (from the same mouse sample but different libraries), denoted as data sets A and B. The calculated methylation levels of non-CpG contexts show about 5-fold difference between the two replicates (Figure 4C). After removing the potentially un-converted reads, the methylation level gaps of non-CpG contexts were narrowed (to about 2-fold). However, this option is not suggested for samples with highly methylated non-CpG context, as it might reject bona fide methylated reads.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/FelixKrueger/Bismark/issues/76#issuecomment-260528079, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJpN--XflbWaGwUlrBMjouhwNBZS0vDks5q-RdPgaJpZM4KwYqs .

FelixKrueger commented 7 years ago

Right, I'll see what I can do... Stay tuned.

FelixKrueger commented 7 years ago

I have now added a new script called filter_non_conversion to the Bismark repo that scans Bismark BAM files for too many methylation calls in non-CG context. The current features are:

Added here: 5d804cfb2b72583a53382aed3e8a0ed588589334.

Could you please clone the repo and give it a whirl? I am open for any further suggestions. Cheers, Felix

Shicheng-Guo commented 7 years ago

Hi Felix,

I have tried the function of _filter_nonconversion. it works perfect.

Shicheng

FelixKrueger commented 7 years ago

Excellent, thanks for testing.