FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
386 stars 101 forks source link

what is the parameter of deciding a specific position methylated or unmethylated? #263

Closed AudryZhao closed 5 years ago

AudryZhao commented 5 years ago

Hi Felix,

  By using the amazing Bismark, we can get the methylation percentage of a specific position. And we also can get the number total methylated C from *_report.txt file. 
  I am curious to know how Bismark decide a specific position methylated or not, when counting total methylated C.

Thanks.

FelixKrueger commented 5 years ago

Bismark decides whether a cytosine is methylated or not on a per-read basis, purely based on whether there was a C (methylated) or T (unmethylated = bisulfite converted) in the read.

Further downstream steps such as the methylation extraction, and coverage file generation, will then collect all the evidence of all different reads covering a certain position. This will then give you a percentage of methylation for each covered position to estimate the average methylation.

Does this help?

AudryZhao commented 5 years ago

Maybe I did not explain my question clearly. An example may help. Like these two position in the coverage file:

FelixKrueger commented 5 years ago

All that this means is that there were 83 separate reads in total which covered position 4445, and out of these 2 were found methylated, and 81 unmethylated. Bismark doesn't perform any kind of overarching methylation call for a position as such, it really just sums up all individual reads that covered this position.

Assuming this position is a CpG dinucleotide, where 4445 is the top strand C, and 4446 is the bottom strand C, you are probably puzzled why the two Cs don't have a more similar methylation signature? I am afraid this isn't something I can answer easily. All the coverage file is telling you is that the top position was covered by 83 reads, the bottom position by 106 reads, and the methylation levels seem to be quite different.

One possibility is that at least one allele had a C-T SNP at the top strand C (which would be called as unmethylated), or maybe the read mis-mapped to that position?

You could try to use coverage2cytosine and there use the option --merge_CpG. This then has an option to filter out non-agreeing CpG positions and write them to a different file, so you could try and investigate those positions a bit further.

AudryZhao commented 5 years ago

so, when count the number of total methylated C, Bismark just summarize all the reads that are not bisulfite converted, like 2+91 in the above example. and 81+15 is the number of unmethylated C.

am i right?

FelixKrueger commented 5 years ago

In its default mode, Bismark works on single-cytosine resolution, and therefore doesn't do any summing up of top and bottom strands or make any other assumptions. It simply says that 2 (out of 83) were found methylated at pos 4445, and 95 (out of 106) were found methylated at pos 4446. Nothing more, nothing less.

If you were to use coverage2cytosine with the options --merge_CpG and --discordant, only THEN would it lump together together all methylated calls, and all unmethylated calls, and calculate a single percentage methylation for the CpG-dinucleotide. This would in this case give you a completely wrong impression though because one C is highly unmethylated, the other one highly methylated, so the average would be somewhere in the middle.

For this reason, the option --discordant will first calculate the percentages of top and bottom strand Cs separately, and see if they agree with other. Only if they do will coverage2cyosine continue to merge top and bottom strand into a single CpG dinucleotide entity. If they do not agree, this position will be written to a separate 'discordant' coverage file. Does that make sense?

AudryZhao commented 5 years ago

I guess I get your point, finally. Thank you for your time and patience. Have a nice day.

FelixKrueger commented 5 years ago

Excellent, glad this is is sorted.