Closed AudryZhao closed 5 years ago
Bismark decides whether a cytosine is methylated or not on a per-read basis, purely based on whether there was a C (methylated) or T (unmethylated = bisulfite converted) in the read.
Further downstream steps such as the methylation extraction, and coverage file generation, will then collect all the evidence of all different reads covering a certain position. This will then give you a percentage of methylation for each covered position to estimate the average methylation.
Does this help?
Maybe I did not explain my question clearly. An example may help. Like these two position in the coverage file:
Contig5386 4445 4445 2.4096 2 81
Contig5386 4446 4446 85.8490 91 15
the first position had 2 Cs, and 81 Ts. while the second position had 91 Cs and 15 Ts. when did methylation call, how does Bismark decide these two position to be "Z" or "z" (say they are in CG context)?
All that this means is that there were 83 separate reads in total which covered position 4445, and out of these 2 were found methylated, and 81 unmethylated. Bismark doesn't perform any kind of overarching methylation call for a position as such, it really just sums up all individual reads that covered this position.
Assuming this position is a CpG dinucleotide, where 4445 is the top strand C, and 4446 is the bottom strand C, you are probably puzzled why the two Cs don't have a more similar methylation signature? I am afraid this isn't something I can answer easily. All the coverage file is telling you is that the top position was covered by 83 reads, the bottom position by 106 reads, and the methylation levels seem to be quite different.
One possibility is that at least one allele had a C-T SNP at the top strand C (which would be called as unmethylated), or maybe the read mis-mapped to that position?
You could try to use coverage2cytosine
and there use the option --merge_CpG
. This then has an option to filter out non-agreeing CpG positions and write them to a different file, so you could try and investigate those positions a bit further.
so, when count the number of total methylated C, Bismark just summarize all the reads that are not bisulfite converted, like 2+91 in the above example. and 81+15 is the number of unmethylated C.
am i right?
In its default mode, Bismark works on single-cytosine resolution, and therefore doesn't do any summing up of top and bottom strands or make any other assumptions. It simply says that 2 (out of 83) were found methylated at pos 4445, and 95 (out of 106) were found methylated at pos 4446. Nothing more, nothing less.
If you were to use coverage2cytosine
with the options --merge_CpG
and --discordant
, only THEN would it lump together together all methylated calls, and all unmethylated calls, and calculate a single percentage methylation for the CpG-dinucleotide. This would in this case give you a completely wrong impression though because one C is highly unmethylated, the other one highly methylated, so the average would be somewhere in the middle.
For this reason, the option --discordant
will first calculate the percentages of top and bottom strand Cs separately, and see if they agree with other. Only if they do will coverage2cyosine
continue to merge top and bottom strand into a single CpG dinucleotide entity. If they do not agree, this position will be written to a separate 'discordant' coverage file. Does that make sense?
I guess I get your point, finally. Thank you for your time and patience. Have a nice day.
Excellent, glad this is is sorted.
Hi Felix,
Thanks.