BSSeeker / BSseeker2

A versatile aligning pipeline for bisulfite sequencing data
http://pellegrini.mcdb.ucla.edu/BS_Seeker2/
MIT License
60 stars 25 forks source link

Potential bias introduced by FilterReads.py #17

Closed yupenghe closed 6 years ago

yupenghe commented 6 years ago

I just found that doing read filter using FilterReads.py may introduce bias that is not easily noticed. This may has been reported previously but I feel it may be useful to put a warning on README such that people realize such potential bias.

The bias will lead to overestimation of methylation level at lowly methylated sites (<50%) and underestimation at highly methylated sites (>50%). Suppose there are 100 reads covering one isolated CpG whose actual methylation level is 10%. We would expect 90 reads contain T basecalls and 10 reads contain C basecalls, which give us 90% methylation level. If we do FilterReads.py, then only one read with C basecall and one read with T basecall (two in each case if we consider both strands) will be kept, resulting in 50% methylation level. Such bias is more prominent on regions with high coverage and drags methylation level towards 50%. On lowly covered region, the effect is moderate.

guoweilong commented 6 years ago

Hi @yupenghe Yupeng, thanks for your comment. Actrually, the FilterReads.py script is an optional one, which were added on the request of some users of BS-Seeker2. I have added warning message in the script, and README file.

For WGBS, it is still useful to remove reads from PCR effect; and "cgmaptools rmdup" also works in similiar way. For RRBS, it is not suggested to use this script; as more reads from the same CCGG----CCGG fragment would have similar starts and ends.

Thanks for your message. And looking forward for your further feedback!

Weilong Guo

yupenghe commented 6 years ago

Thanks very much, Weilong. That makes sense. Just want to share some further thoughts with you. My colleague and I found this issue because doing FilterReaD.py always gives very high bisulfite non-conversion rate (~3%) on unmethylated spike-in control (which frequently has >100x coverage without clonal removal). I tried to use picard for removing PCR duplicates and the relatively high non-conversion rate disappears (to 0.3%). I don't think it will affect much on autosomes and sex chromosomes due to much lower coverage compared to the spike-in.

Yupeng

guoweilong commented 6 years ago

Thanks Yupeng!

Recently, our new software package "CGmapTools : https://cgmaptools.github.io" is published in Bioinformatics ("CGmapTools improves the precision of hetero- zygous SNV calls and supports allele-specific methylation detection and visualization in bisulfite-sequencing data, Bioinformatics (2017), btx595"). This package is designed as direct downstream analysis package following BS-Seeker2. Hope you will find it be useful. Looking forward to feedback for both BS-Seeker2 and CGmapTools.

Best, Weilong

At 2017-12-08 11:08:02, "Yupeng He" notifications@github.com wrote:

Thanks very much, Weilong. That makes sense. Just want to share some further thoughts with you. My colleague and I found this issue because doing FilterReaD.py always gives very high bisulfite non-conversion rate (~3%) on unmethylated spike-in control (which frequently has >100x coverage without clonal removal). I tried to use picard for removing PCR duplicates and the relatively high non-conversion rate disappears (to 0.3%). I don't think it will affect much on autosomes and sex chromosomes due to much lower coverage compared to the spike-in.

Yupeng

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.