the current main branch does not really consider the modification probability deeply, but it is actually important to do so to properly visualize it. this is because the data can report multiple different modifications at the same position, and it is best to only render the most probable one.
SAMtags.pdf shows an example where both 5mc and 5hmc are listed for each position, and the user should choose the modificaiton that has the highest probability (or even choose "no modification" if neither is a high probability. specifically, multiple modifications at the same position can't sum up to greater than 1.0 probability, and i believe, chemically, only one modification is even possible. otherwise it would get a chemical code. therefore, double counting modifications at a single position is misleading, but that's what we have in our UI on main
Rendering after this PR
IGV rendering
i modified the rendering to more closely follow IGV. it is a bit copy cat, but IGV does a lot of things right i believe.
text from SAMtags about the probabilities in the ML tag
"Note where several possible modifica-
tions are presented at the same site, the ML values represent the absolute probabilities of the modifi-
cation call being correct and not the relative likelihood between the alternatives. These probabilities
should not sum to above 1.0 (≈ 256 in integer encoding, allowing for some minor rounding errors),
but may sum to a lower total with the remainder representing the probability that none of the listed
modification types are present. In the example used above, the 6th C has 80% chance of being 5mC,
10% chance of being 5hmC and 10% chance of being an unmodified C"
coverage calculation
Now in coverage: now in modification mode the snpcoverage does not draw the raw number of modifications at the position, but the proportion of "modifiable" bases at that position. this aligns better with user expectations and is what igv does. therefore, for a CpG, on the forward strand, only half the reads will be a C at the CpG position, and the other half will be G on the reverse strand, but only the C can be methylated. but if all the C's there are methylated, then basically we can draw that as that this position is "fully modified", therefore maxing out the y-axis of the coverage track
modification colors
I adjusted the color scheme of modifications to match IGV
retained the "methylation" mode
i was hoping to maybe get rid of this, but the data files for BAM simply do not indicate unmethylated positions in some cases
igv is not to my knowledge, in this case, is not able to draw the unmethylated cpg's
Rendering before this PR
the current main branch does not really consider the modification probability deeply, but it is actually important to do so to properly visualize it. this is because the data can report multiple different modifications at the same position, and it is best to only render the most probable one.
SAMtags.pdf shows an example where both 5mc and 5hmc are listed for each position, and the user should choose the modificaiton that has the highest probability (or even choose "no modification" if neither is a high probability. specifically, multiple modifications at the same position can't sum up to greater than 1.0 probability, and i believe, chemically, only one modification is even possible. otherwise it would get a chemical code. therefore, double counting modifications at a single position is misleading, but that's what we have in our UI on main
Rendering after this PR
IGV rendering
i modified the rendering to more closely follow IGV. it is a bit copy cat, but IGV does a lot of things right i believe.
text from SAMtags about the probabilities in the ML tag
https://samtools.github.io/hts-specs/SAMtags.pdf
"Note where several possible modifica- tions are presented at the same site, the ML values represent the absolute probabilities of the modifi- cation call being correct and not the relative likelihood between the alternatives. These probabilities should not sum to above 1.0 (≈ 256 in integer encoding, allowing for some minor rounding errors), but may sum to a lower total with the remainder representing the probability that none of the listed modification types are present. In the example used above, the 6th C has 80% chance of being 5mC, 10% chance of being 5hmC and 10% chance of being an unmodified C"
coverage calculation
Now in coverage: now in modification mode the snpcoverage does not draw the raw number of modifications at the position, but the proportion of "modifiable" bases at that position. this aligns better with user expectations and is what igv does. therefore, for a CpG, on the forward strand, only half the reads will be a C at the CpG position, and the other half will be G on the reverse strand, but only the C can be methylated. but if all the C's there are methylated, then basically we can draw that as that this position is "fully modified", therefore maxing out the y-axis of the coverage track
modification colors
I adjusted the color scheme of modifications to match IGV
retained the "methylation" mode
i was hoping to maybe get rid of this, but the data files for BAM simply do not indicate unmethylated positions in some cases
igv is not to my knowledge, in this case, is not able to draw the unmethylated cpg's