BSSeeker / BSseeker2

A versatile aligning pipeline for bisulfite sequencing data
http://pellegrini.mcdb.ucla.edu/BS_Seeker2/
MIT License
60 stars 25 forks source link

Number of Cs reported in ATCGmap #19

Closed akramdi closed 5 years ago

akramdi commented 5 years ago

Hi,

I am working with a reference genome that contains about 4310^6 Cs genome wide (tair10). I noticed that the reported number of cytosines in the ATCGmap output file does not match the number of reference Cs (I counted the number of Cs and Gs positions in ATCGmap file and found about 4210^6). Moreover, this total number of Cs reported is not constant between samples.

I assumed that the ATCGmap file would contain information for every reference C. Why this difference?

Best, Amira

guoweilong commented 5 years ago

Thanks for your question.

The ATCGmap file will only contain C site with at least 1 read covered.

Best, Weilong Guo

akramdi commented 5 years ago

Thanks for the quick reply!

Indeed, all reported Cs are covered by at least 1 read. I actually wondered about this possibility, so I checked the coverage information in the columns used to compute the methylation level (col $8 and 7$ if C in + strand ; col $14 and $11 if C in - strand), there were still Cs with 0 coverage ($8+$7=0 or $11+$14=0), so I ignored it.

Which brings me to another question about the methylation level (it might be trivial, please bear with me):

If we have a C on the + strand, why not consider $13 (# of reads from Crick strand mapped here, support C on Watson strand and G on Crick strand) as reads supporting a state of methylation?

Best, Amira

guoweilong commented 5 years ago

For a C on the Watson strand in reference genome, the $13 column would not contribute any information for the methylation status.

Actually, if you read the CGmap file, it would have fewer C/G sites, which only report positions those methylation-effective coverage (MEC) >=1.

Best, Weilong

akramdi commented 5 years ago

Got it, thanks!

Best, Amira